# Feature Engineering and Extraction Techniques

Feature engineering and extraction are crucial steps in the data preprocessing phase of any data science or machine learning project. These techniques help in creating new features or extracting relevant information from raw data, which can significantly improve the performance of machine learning models. 

Feature engineering focuses on creating new features or modifying existing ones to improve predictive model performance. On the other hand, feature extraction involves reducing the dimensionality of the dataset while retaining relevant information.

**Objectives**
* Enhance the predictive power of machine learning models
* Reduce overfitting and improve model generalization
* Capture relevant information from the data
* Improve interpretability of the models
* Use Cases
* Predictive modeling
* Pattern recognition
* Anomaly detection
* Recommendation systems
* Natural language processing
* Computer vision
* Time series analysis

**Examples of Datasets**
* Titanic dataset (for classification)
* Boston housing dataset (for regression)
* Iris dataset (for classification)
* MNIST dataset (for image classification)
* Text datasets (e.g., IMDb movie reviews)

# Feature Engineering Techniques:
## Handling Missing Values:

Missing values, often represented by NaN (Not a Number) or null, are a common challenge in data analysis. Here's a breakdown of techniques to handle them:

1. **Deletion:**
Simplest approach, removes rows or columns with missing values.

**Use Cases:**
* Small amount of missing data.
* Missingness is random (MCAR - Missing Completely At Random).
* Disadvantages:
* Loses potentially valuable information.
* Can bias results if missingness is not random (MNAR - Missing Not At Random).

2. **Imputation:**
Replaces missing values with estimated values.

**Use Cases:**
* When deletion is not feasible due to data loss.
* Different imputation techniques work for different data types and missingness patterns.

**Types of Imputation**:

* Mean/Median/Mode Imputation: Replaces missing values with the mean, median, or most frequent value of the feature (for numerical and categorical features respectively).
* Random Sample Imputation: Fills missing values with a random value from existing data points in the same feature.
* K-Nearest Neighbors (KNN) Imputation: Uses the values of the k nearest neighbors (data points most similar to the one with missing value) to estimate the missing value.
* Model-based Imputation: Uses a machine learning model to predict missing values based on other features.

3. **Feature Engineering with Missingness**:

Objective: Utilizes missingness information to create new features.
Use Cases: When missingness itself holds meaning (e.g., income not reported could indicate a certain income bracket).
Example: Create a new binary feature "income_missing" to indicate if income data is missing.

4. **Ignoring Missing Values**:

Objective: Simplest for specific algorithms that can handle missing values natively.
Use Cases: Limited, only for algorithms specifically designed to work with missing data (e.g., decision trees with specific handling mechanisms).
Disadvantages: Not generally recommended, can lead to biased results.

5. **Forward filling**: Propagate the last observed value forward along the sequence of observations to fill missing values.
Time series data where missing values represent temporary interruptions or gaps in data collection.
Surveys or questionnaires where respondents choose not to answer certain questions.

6. **Backward filling**: Propagate the next observed value backward along the sequence of observations to fill missing values.
Time series data where missing values represent temporary interruptions or gaps in data collection.
Surveys or questionnaires where respondents choose not to answer certain questions.

7. **Interpolation**:  Estimates missing values by using the values of surrounding data points. Preserves data and avoids deletion.

**Types of Interpolation**:

* Linear Interpolation: Fills the missing value with the average of the two nearest existing values.
* Polynomial Interpolation: Fits a polynomial function through the surrounding data points and uses it to estimate the missing value.
* Spline Interpolation: Uses piecewise polynomial functions to create a smoother fit than linear interpolation.

**Suitability:**

* Interpolation works best for continuous features with a predictable missingness pattern.
* For categorical features or random missingness, it might introduce bias by assuming a specific relationship between data points.

**The best approach depends on:**

* Amount of missing data: Deletion might be acceptable for small amounts.
* Missingness pattern: Random missingness (MCAR) allows for simpler techniques like deletion or mean imputation. Non-random missingness (MNAR) requires more sophisticated methods like KNN or model-based imputation.
* Data type: Imputation techniques differ for numerical and categorical features.
* Domain knowledge: Understanding the data and reasons for missingness can guide the choice of technique.

**Remember:** There's no one-size-fits-all solution. Experiment with different techniques and evaluate their impact on model performance.

##  Encoding Categorical Variables:

Used to convert categorical variables into a numerical format suitable for machine learning algorithms. It ensures that categorical variables do not bias the model due to their non-numeric nature.

**Use Cases:**

* Text data such as movie genres, product categories, or city names.
* Nominal data such as gender, marital status, or education level.

**Examples:**

* One-hot encoding: Convert categorical variables into binary vectors representing the presence or absence of each category.
* Label encoding: Convert categorical labels into numerical representations using integer encoding.
* Target encoding: Encode categorical variables based on the target variable's mean or frequency.

In [None]:
# One-hot encoding
encoded_df = pd.get_dummies(df, columns=['categorical_column'])

# Label encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['categorical_column'] = le.fit_transform(df['categorical_column'])

# Target encoding
import category_encoders as ce
target_encoder = ce.TargetEncoder(cols=['categorical_column'])
encoded_df = target_encoder.fit_transform(df, df['target_column'])


## Data Transformation:

Use to transform features into a more suitable format for modeling or analysis. It ensures that the data distribution meets the assumptions of machine learning algorithms.

**Use Cases:**

* Scaling numerical features to a common range to prevent features with large values from dominating the model.
* Normalizing features to ensure that they have a mean of zero and a standard deviation of one.
* Log transformation to stabilize variance and make the data distribution more symmetrical.

**Examples:**

* Min-max scaling: Scale numerical features to a specified range (e.g., [0, 1]).
* Standardization: Transform features to have a mean of zero and a standard deviation of one.
* Log transformation: Apply a logarithmic function to features to stabilize variance and improve interpretability.

In [None]:
# Min-max scaling
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(df[['feature1', 'feature2']])

# Standardization
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
standardized_features = scaler.fit_transform(df[['feature1', 'feature2']])

# Log transformation
import numpy as np
df['log_feature'] = np.log(df['feature'])

## Handling Outliers:

To identify and mitigate the impact of outliers on model performance. To ensure that extreme values do not unduly influence the results of statistical analyses or machine learning algorithms.

**Use Cases:**

* Financial data where extreme values may represent anomalies or errors.
* Healthcare data where outliers may indicate unusual patient conditions or measurement errors.

**Examples:**

* Clipping: Limit the range of values by setting a minimum and maximum threshold.
* Winsorization: Replace extreme values with values from the 5th and 95th percentiles.
* Box-Cox transformation: Transform data to achieve a normal distribution and reduce the impact of outliers.

In [None]:
# Clipping
df['clipped_feature'] = df['feature'].clip(lower=min_threshold, upper=max_threshold)

# Winsorization
from scipy.stats.mstats import winsorize
df['winsorized_feature'] = winsorize(df['feature'], limits=(0.05, 0.05))

# Box-Cox transformation
from scipy.stats import boxcox
transformed_feature, _ = boxcox(df['feature'])


## Feature Extraction:

To create new features from existing ones to capture additional information or patterns in the data. To reduce the dimensionality of the dataset while preserving important information.

**Use Cases:**
* Natural language processing (NLP) tasks where text features need to be converted into numerical representations.
* Image processing tasks where raw pixel values are transformed into higher-level features.
* Time series analysis where raw time series data is transformed into features that capture trends, seasonality, and other patterns.

**Examples:**

* Polynomial features: Create interaction terms between features up to a specified degree.
* Interaction features: Combine two or more features to capture relationships between them.
* Time series features: Extract statistical measures, frequency domain features, or lagged values from time series data.

In [None]:
# Polynomial features
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2)
poly_features = poly.fit_transform(df[['feature1', 'feature2']])

# Interaction features
df['interaction_feature'] = df['feature1'] * df['feature2']

# Time series features
# Example: Extracting lagged values
df['lag_1'] = df['feature'].shift(1)
df['lag_2'] = df['feature'].shift(2)

## Feature Selection:

To identify the most relevant features for modeling and analysis. To reduce overfitting and improve model generalization by removing irrelevant or redundant features.

**Use Cases:**

* High-dimensional datasets where not all features contribute equally to the outcome.
* Models where feature selection can help improve performance and interpretability.

**Examples:**

* Recursive feature elimination (RFE): Iteratively remove the least important features based on model performance.
* Lasso regression: Regularize the model to encourage sparsity and automatically select features.
* Tree-based feature importance: Use decision trees to determine the importance of each feature in predicting the target variable.

In [None]:
# Recursive feature elimination (RFE)
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
selector = RFE(estimator=LinearRegression(), n_features_to_select=5)
selector.fit(X, y)
selected_features = X[:, selector.support_]

# Lasso regression
from sklearn.linear_model import LassoCV
lasso = LassoCV()
lasso.fit(X, y)
selected_features = X[:, lasso.coef_ != 0]

# Tree-based feature importance
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor()
rf.fit(X, y)
importance = rf.feature_importances_

## Dimensionality Reduction:

**Objectives:**

* To reduce the number of features in the dataset while preserving as much information as possible.
* To alleviate the curse of dimensionality and improve model performance and interpretability.

**Use Cases:**

High-dimensional datasets where the number of features exceeds the number of observations.
Models that suffer from multicollinearity or overfitting due to a large number of features.

**Examples:**

* Principal Component Analysis (PCA): Transform the original features into a new set of orthogonal components that capture the maximum variance.
* Singular Value Decomposition (SVD): A matrix factorization method that decomposes a matrix into three constituent matrices to capture latent features.
* t-distributed Stochastic Neighbor Embedding (t-SNE): Reduce dimensionality while preserving the local structure of the data points.
* Uniform Manifold Approximation and Projection (UMAP): Non-linear dimensionality reduction technique that preserves both local and global structure.

In [None]:
# Principal Component Analysis (PCA)
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
principal_components = pca.fit_transform(X)

# Singular Value Decomposition (SVD)
from sklearn.decomposition import TruncatedSVD
svd = TruncatedSVD(n_components=2)
svd_features = svd.fit_transform(X)


# t-distributed Stochastic Neighbor Embedding (t-SNE)
from sklearn.manifold import TSNE
tsne = TSNE(n_components=2)
embedded_features = tsne.fit_transform(X)

# Uniform Manifold Approximation and Projection (UMAP)
import umap
reducer = umap.UMAP(n_components=2)
umap_features = reducer.fit_transform(X)

## Clustering Features:

**Objectives:**

* To group similar observations into clusters based on their feature similarity.
* To identify patterns and structure within the dataset.

**Use Cases:**

* Unsupervised learning tasks where the target variable is not available.
* Data exploration and segmentation tasks to understand the underlying structure of the data.

**Examples:**

* K-means clustering: Partition the dataset into k clusters based on feature similarity.
* DBSCAN (Density-Based Spatial Clustering of Applications with Noise): Identify clusters of varying shapes and sizes based on density.
* Hierarchical clustering: Build a tree of clusters where the similarity between clusters is determined by the distance between observations.

In [None]:
# K-means clustering
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3)
clusters = kmeans.fit_predict(X)

# DBSCAN
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
clusters = dbscan.fit_predict(X)

# Hierarchical clustering
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=3)
clusters = agg.fit_predict(X)

## Graph-based Features:

**Objectives:**

* To extract features from graph-structured data such as social networks, recommendation systems, and biological networks.
* To capture the relationships and interactions between nodes in the graph.

**Use Cases:**

* Social network analysis to identify influential nodes or communities.
* Recommendation systems to model user-item interactions and similarity.
* Biological network analysis to understand protein-protein interactions and gene regulatory networks.

**Examples:**

* Graph Embeddings: Learn low-dimensional representations of nodes in a graph that capture structural information and node similarity.
* Graph Kernels: Compute similarity measures between pairs of graphs based on common substructures or node features.
* Structural Features: Extract features such as node degree, clustering coefficient, and centrality measures to characterize the network topology.

In [None]:
# Graph Embeddings
import stellargraph as sg
from stellargraph import StellarGraph
from stellargraph import StellarGraph
from stellargraph.data import BiasedRandomWalk
from stellargraph.data import UnsupervisedSampler
from stellargraph.mapper import Node2VecLinkGenerator, Node2VecNodeGenerator
from stellargraph.layer import Node2Vec, link_classification
from stellargraph import datasets

# Load a graph dataset
dataset = datasets.Cora()
graph = dataset.load()

# Define the random walk generator
rw = BiasedRandomWalk(graph)

# Define the UnsupervisedSampler
unsupervised_samples = UnsupervisedSampler(graph, nodes=list(graph.nodes()), length=10, number_of_walks=1)

# Define the node2vec generator
generator = Node2VecNodeGenerator(graph, batch_size=50, num_samples=[10, 5])

# Create the Node2Vec model
node2vec = Node2Vec(generator=generator, embedding_dimension=128, walk_length=10, num_walks=1, window_size=5, p=1, q=1)

# Embed nodes
node_embeddings = node2vec.fit(graph, verbose=1)

# Extract features
features = node_embeddings.to_numpy()

# Graph Kernels
from grakel.kernels import WeisfeilerLehman, VertexHistogram

# Compute graph kernels
wl_kernel = WeisfeilerLehman(n_iter=5)
kernel_matrix = wl_kernel.fit_transform(graph)

## Structural Features:

**Objectives**:

* To capture the structural properties of objects represented as graphs.
* To quantify topological characteristics and relationships between nodes in the graph.

**Use Cases:**

* Network analysis to identify important nodes or communities.
* Graph classification and clustering tasks based on structural features.
* Recommendation systems and link prediction in graph-structured data.

**Examples:**

* Graph Degree: Measure the number of edges incident to a node, indicating its connectivity.
* Graph Clustering Coefficients: Quantify the degree to which nodes tend to cluster together.
* Betweenness Centrality: Identify nodes that act as bridges between different parts of the graph.

In [None]:
# Graph Degree
degree = dict(graph.degree())

# Graph Clustering Coefficients
clustering_coefficients = nx.clustering(graph)

# Betweenness Centrality
betweenness_centrality = nx.betweenness_centrality(graph)


## Texture Features:

**Objectives:**

* To capture the surface properties and patterns in image data.
* To quantify the spatial arrangement of pixel intensities and their variations.

**Use Cases:**

* Texture classification and segmentation in medical imaging and satellite imagery.
* Material recognition and defect detection in manufacturing and quality control.
* Remote sensing and geospatial analysis for land cover classification and environmental monitoring.

**Examples:**

* Gray Level Co-occurrence Matrix (GLCM): Quantify the spatial relationships between pairs of pixel intensities.
* Gabor Filters: Extract texture features by convolving an image with a set of Gabor filter kernels.
* Local Binary Patterns (LBP): Encode local texture patterns based on the comparison of pixel values with neighboring pixels.

--Gray Level Co-occurrence Matrix (GLCM)
from skimage.feature import greycomatrix, greycoprops

--Compute GLCM
glcm = greycomatrix(image, distances=[1], angles=[0], levels=256, symmetric=True, normed=True)

--Extract GLCM properties
contrast = greycoprops(glcm, 'contrast')[0, 0]
homogeneity = greycoprops(glcm, 'homogeneity')[0, 0]

--Gabor Filters
from skimage.filters import gabor

--Compute Gabor features
gabor_features = gabor(image, frequency=0.6)

--Local Binary Patterns (LBP)
from skimage.feature import local_binary_pattern

--Compute LBP
lbp = local_binary_pattern(image, P=8, R=1, method='uniform')

## Text Feature Extraction (NLP):

1. **Bag-of-Words (BoW)**:

* Overview: Represents text data based on the frequency of occurrence of words.
* Objectives: Convert text data into numerical form suitable for machine learning algorithms.
* Use Cases: Document classification, sentiment analysis.
* Examples: Document-term matrix.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

# Create bag-of-words model
vectorizer = CountVectorizer()
X_bow = vectorizer.fit_transform(corpus)

2. **Term Frequency-Inverse Document Frequency (TF-IDF)**:

* Overview: Represents text data based on the importance of words in documents.
* Objectives: Capture the significance of words in a document relative to a corpus.
* Use Cases: Information retrieval, text classification.
* Examples: TF-IDF scores.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Create TF-IDF model
vectorizer = TfidfVectorizer()
X_tfidf = vectorizer.fit_transform(corpus)

3. **Word Embeddings (Word2Vec, GloVe)**:

* Overview: Represent words as dense vectors in a continuous vector space.
* Objectives: Capture semantic relationships between words.
* Use Cases: Text similarity, language translation, sentiment analysis.
* Examples: Word2Vec embeddings, GloVe embeddings.

## Image Feature Extraction:

1. **Pixel Intensity Features**:

* Overview: Represents images based on the intensity values of pixels.
* Objectives: Capture low-level image characteristics.
* Use Cases: Image classification, edge detection.
* Examples: Mean pixel intensity, standard deviation of pixel intensity.

In [None]:
import numpy as np
from PIL import Image

# Load image
img = Image.open('image.jpg')

# Extract pixel intensity features
pixel_intensity = np.array(img).flatten()
mean_intensity = np.mean(pixel_intensity)
std_intensity = np.std(pixel_intensity)

2. **Histogram of Oriented Gradients (HOG)**:

* Overview: Represents the distribution of gradient orientations in an image.
* Objectives: Capture local object shape and structure.
* Use Cases: Object detection, pedestrian detection.
* Examples: HOG descriptors.

In [None]:
from skimage.feature import hog
from skimage import exposure

# Extract HOG features
fd, hog_image = hog(image, orientations=8, pixels_per_cell=(16, 16),
                    cells_per_block=(1, 1), visualize=True, multichannel=True)

# Rescale histogram for better visualization
hog_image_rescaled = exposure.rescale_intensity(hog_image, in_range=(0, 10))

## Convolutional Neural Networks (CNNs)
* Overview: Deep learning models designed for processing and analyzing visual data.
* Objectives: Automatically learn hierarchical feature representations from images.
* Use Cases: Image classification, object detection, image segmentation.
* Examples: Various architectures like VGG, ResNet, Inception, etc.
* Code Example: Implementation using deep learning frameworks like TensorFlow or PyTorch.

## Deep Learning-based Feature Extraction:

**Objectives:**

* To automatically learn discriminative representations from raw data using deep neural networks.
* To capture hierarchical patterns and complex relationships in high-dimensional data.

**Use Cases:**

* Image classification and object detection in computer vision tasks.
* Natural language processing for sentiment analysis and language translation.
* Speech recognition and generation in audio processing applications.

**Examples:**

* Transfer Learning: Fine-tuning pre-trained deep learning models on specific tasks with limited annotated data.
* Pre-trained Models: Using existing deep learning architectures trained on large-scale datasets to extract features.
* Autoencoders: Unsupervised deep learning models that learn compact representations of input data by reconstructing it from a compressed latent space.

**Deep Learning Models (BERT, GPT, Transformers)**:

* Overview: State-of-the-art deep learning architectures for processing sequential data.
* Objectives: Capture contextual information and long-range dependencies in text.
* Use Cases: Natural language understanding, question answering, text generation.
* Examples: BERT, GPT-3, Transformer models. Usage requires pre-trained models and specific deep learning frameworks.

In [None]:
# Transfer Learning
from tensorflow.keras.applications import VGG16
from tensorflow.keras.preprocessing import image
from tensorflow.keras.applications.vgg16 import preprocess_input
import numpy as np

# Load pre-trained VGG16 model
model = VGG16(weights='imagenet', include_top=False)

# Load and preprocess image
img_path = 'image.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

# Extract features
features = model.predict(x)

# Pre-trained Models
from tensorflow.keras.applications import ResNet50

# Load pre-trained ResNet50 model
model = ResNet50(weights='imagenet', include_top=False)

# Extract features
features = model.predict(x)

# Autoencoders
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model

# Define autoencoder architecture
input_img = Input(shape=(784,))
encoded = Dense(128, activation='relu')(input_img)
encoded = Dense(64, activation='relu')(encoded)
encoded = Dense(32, activation='relu')(encoded)
decoded = Dense(64, activation='relu')(encoded)
decoded = Dense(128, activation='relu')(decoded)
decoded = Dense(784, activation='sigmoid')(decoded)

# Compile autoencoder model
autoencoder = Model(input_img, decoded)
autoencoder.compile(optimizer='adam', loss='binary_crossentropy')

# Train autoencoder
autoencoder.fit(x_train, x_train, epochs=10, batch_size=256, shuffle=True, validation_data=(x_test, x_test))

**Dimensionality Reduction Techniques:**

* Principal Component Analysis (PCA)
* Singular Value Decomposition (SVD)
* t-distributed Stochastic Neighbor Embedding (t-SNE)
* Uniform Manifold Approximation and Projection (UMAP)
* Kernel Methods (Kernel PCA, Kernel SVM)
* Autoencoders for Unsupervised Feature Learning

**Image Feature Extraction:**

* Pixel Intensity Features
* Histogram of Oriented Gradients (HOG)
* Scale-Invariant Feature Transform (SIFT)
* Speeded Up Robust Features (SURF)
* Local Binary Patterns (LBP)
* Convolutional Neural Networks (CNNs)

**Speech Feature Extraction:**

* Mel-Frequency Cepstral Coefficients (MFCCs)
* Linear Predictive Coding (LPC)
* Perceptual Linear Predictive (PLP) Features
* Filter Bank Energies (FBE)
* Gammatone Filterbank Features
* Deep Learning Architectures (RNNs, CNNs)

**Text Feature Extraction (NLP):**

* Bag-of-Words (BoW)
* Term Frequency-Inverse Document Frequency (TF-IDF)
* Word Embeddings (Word2Vec, GloVe, FastText)
* Character-level Embeddings
* Part-of-Speech (POS) Tagging
* Named Entity Recognition (NER)
* Text Summarization Features
* Syntax Tree Features (Dependency Parsing)
* Deep Learning Models (BERT, GPT, Transformers)

**Predictive Modeling:**

* Statistical Features (Mean, Median, Standard Deviation, Skewness, Kurtosis)
* Frequency Domain Features (FFT, Power Spectral Density)
* Wavelet Transform Features
* Feature Scaling and Normalization
* Feature Selection (RFE, Lasso Regression, Tree-based Feature Importance)
* Feature Engineering (Polynomial Features, Interaction Terms, Time Series Features)

**Other Techniques:**

* Clustering Features (K-means, DBSCAN)
* Graph-based Features (Graph Embeddings, Graph Kernels)
* Structural Features (Graph Degree, Graph Clustering Coefficients)
* Shape Descriptors (Fourier Descriptors, Zernike Moments)
* Texture Features (Gray Level Co-occurrence Matrix, Gabor Filters)
* Deep Learning-based Feature Extraction (Transfer Learning, Pre-trained Models)