# Feature Engineering 
**It can be classified into 4 forms:**

**1. Feature Transformation:**

- a. Missing Value Imputation
    - Simple Imputation
    - Mean/Median/Mode Imputation
    - Forward/Backward Fill
    - Interpolation Methods (Linear, Polynomial, Spline)
    - K-Nearest Neighbors (KNN) Imputation
    - Predictive Modeling (e.g., Regression Models)

- b. Handling Categorical Features
    - One-Hot Encoding
    - Label Encoding
    - Ordinal Encoding
    - Binary Encoding
    - Target Encoding
    - Frequency Encoding

- c. Outlier Detection
    - Z-Score
    - IQR (Interquartile Range)
    - Modified Z-Score
    - DBScan Clustering
    - Isolation Forest
    - Local Outlier Factor(LOF)

- d. Feature Scaling
    - Standard Scaler(Z-score Normalized)
    - MinMax Scaler
    - Max Abs Scaler
    - Robust Scaler

- e. Data Transformation
    - Log Transform
    - Box-Cox Transform
    - Yeo-Johnson Transform
    - Quantile Transform

**2. Feature Construction:**

- Date Features (Year, Month, Day, Hour, Minute, Second)
- Time Series Decomposition (Trend, Seasonality, Residuals)
- Rolling Window Statistics (Mean, Median, Std Dev)
- Lag Features
- Aggregation Features
- Domain Specific Features

**3. Feature Selection:**

- a. Filter Methods
    - Correlation-based Selection
    - Chi-Square Test
    - ANOVA F-Test
    - Mutual Information

- b. Wrapper Methods
    - Recursive Feature Elimination (RFE)
    - Forward/Backward Selection

- c. Embedded Methods
    - Lasso Regularization
    - Ridge Regularization
    - Random Forest Importance

- d. Dimensionality Reduction Techniques
    - Principal Component Analysis (PCA)
    - Linear Discriminant Analysis (LDA)
    - t-SNE

**4. Feature Extraction:**

- Principal Component Analysis (PCA)
- Linear Discriminant Analysis (LDA)
- Kernel PCA
- t-Distributed Stochastic Neighbor Embedding (t-SNE)
- Uniform Manifold Approximation and Projection (UMAP)
-  Independent Component Analysis (ICA)
- Non-negative Matrix Factorization (NMF)
- Singular Value Decomposition (SVD)
______________________________________________________________________

**1. Feature Transformation**
- a. Missing Value Imputation

*These snippets assume you're working with a single column. For multiple columns, you might need to adjust the code accordingly.*

*When fitting missing value we can check for the distribution wether the distribution remains the same or not.*

*For categorical variables, you might want to use mode imputation or create a new category for missing values.*

*Apply a GridSearchCV for different params of imputation to find out which imputation is giving the best result.*

In [20]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer, KNNImputer  
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

df = pd.read_csv("./data/train.csv")
df['column'] = df['Age']

# Simple Imputation
# Mean Imputation: Replaces missing values with the mean of the column
df['column'].fillna(df['column'].mean(), inplace=True)

# Median Imputation: Replaces missing values with the median of the column
df['column'].fillna(df['column'].median(), inplace=True)

# Mode Imputation: Replaces missing values with the most frequent value (mode) of the column
df['column'].fillna(df['column'].mode()[0], inplace=True)

# Forward Fill: Replaces missing values with the previous value in the column
df['column'].fillna(method='ffill', inplace=True)

# Backward Fill: Replaces missing values with the next value in the column
df['column'].fillna(method='bfill', inplace=True)

# Interpolation Methods
# Linear Interpolation: Estimates missing values using a straight line between known values
df['column'].interpolate(method='linear', inplace=True)

# Polynomial Interpolation: Estimates missing values using a polynomial function (curved line) of specified order
df['column'].interpolate(method='polynomial', order=2, inplace=True)

# Spline Interpolation: Estimates missing values using a piecewise polynomial function for a smooth curve
df['column'].interpolate(method='spline', order=2, inplace=True)

# K-Nearest Neighbors (KNN) Imputation: Replaces missing values using the mean value of the nearest neighbors
imputer = KNNImputer(n_neighbors=5)
df['column'] = imputer.fit_transform(df[['column']])

# Predictive Modeling (using IterativeImputer, which uses regression): 
# Estimates missing values using predictive modeling with iterative regression
imputer = IterativeImputer(random_state=0)
df['column'] = imputer.fit_transform(df[['column']])

# Using SimpleImputer for basic strategies
# Replaces missing values using specified strategy ('mean', 'median', 'most_frequent', 'constant')
imputer = SimpleImputer(strategy='mean')  # or 'median', 'most_frequent', 'constant'
df['column'] = imputer.fit_transform(df[['column']])

# Multiple Imputation (using IterativeImputer multiple times): Creates multiple imputations and averages the results
n_imputations = 5
imputed_data = []
for i in range(n_imputations):
    imputer = IterativeImputer(random_state=i)
    imputed_data.append(imputer.fit_transform(df))

# Average the imputations
df_imputed = pd.DataFrame(np.mean(imputed_data, axis=0), columns=df.columns)

- b. Handling Categorical Features

In [1]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from sklearn.preprocessing import KBinsDiscretizer

# One-Hot Encoding: Converts categories to binary vectors
onehot = OneHotEncoder(sparse=False)
onehot_encoded = onehot.fit_transform(df[['categorical_column']])
onehot_df = pd.DataFrame(onehot_encoded, columns=onehot.get_feature_names(['categorical_column']))

# Label Encoding: Assigns unique integer to each category
le = LabelEncoder()
df['label_encoded'] = le.fit_transform(df['categorical_column'])

# Ordinal Encoding: Assigns integers to categories with order
oe = OrdinalEncoder()
df['ordinal_encoded'] = oe.fit_transform(df[['categorical_column']])

# Target Encoding: Replaces categories with mean of the target variable
target_mean = df.groupby('categorical_column')['target'].mean()
df['target_encoded'] = df['categorical_column'].map(target_mean)

# Frequency Encoding: Replaces categories with their frequency
freq_encoding = df['categorical_column'].value_counts(normalize=True)
df['freq_encoded'] = df['categorical_column'].map(freq_encoding)

# K-Bins Discretization: Replaces categories with bins and places its count
kbin_encoded = KBinsDiscretizer(n_bins=15,encode='ordinal',strategy='quantile')
df['kbin_encoded'] = kbin_encoded.fit_transform(df[['kbin_encoded']])


16

- c. Outlier Detection

In [None]:
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.cluster import DBSCAN

# Z-Score: Detects outliers based on standard deviations from the mean
z_scores = np.abs((df['column'] - df['column'].mean()) / df['column'].std())
df['is_outlier_zscore'] = z_scores > 3

# IQR: Detects outliers based on the interquartile range
Q1 = df['column'].quantile(0.25)
Q3 = df['column'].quantile(0.75)
IQR = Q3 - Q1
df['is_outlier_iqr'] = (df['column'] < (Q1 - 1.5 * IQR)) | (df['column'] > (Q3 + 1.5 * IQR))

# Isolation Forest: Detects outliers using an ensemble of trees
iso_forest = IsolationForest(contamination=0.1)
df['is_outlier_iforest'] = iso_forest.fit_predict(df[['column']])

# Local Outlier Factor: Detects outliers by comparing the local density of a point to its neighbors
lof = LocalOutlierFactor()
df['is_outlier_lof'] = lof.fit_predict(df[['column']])

# DBScan Clustering: Detects outliers as points that do not belong to any cluster
dbscan = DBSCAN(eps=3, min_samples=2)
df['is_outlier_dbscan'] = dbscan.fit_predict(df[['column']])
df['is_outlier_dbscan'] = df['is_outlier_dbscan'] == -1

- d. Feature Scaling

In [None]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler

# Standard Scaler: Scales features by removing the mean and scaling to unit variance
scaler = StandardScaler()
df['scaled_standard'] = scaler.fit_transform(df[['column']])

# MinMax Scaler: Scales features to a given range (default is 0 to 1)
scaler = MinMaxScaler()
df['scaled_minmax'] = scaler.fit_transform(df[['column']])

# Robust Scaler: Scales features using statistics that are robust to outliers (uses median and IQR)
scaler = RobustScaler()
df['scaled_robust'] = scaler.fit_transform(df[['column']])

# MaxAbs Scaler: Scales features by their maximum absolute value, preserving the sign
scaler = MaxAbsScaler()
df['scaled_maxabs'] = scaler.fit_transform(df[['column']])

- e. Data Transformation

In [None]:
import numpy as np
from scipy import stats
from sklearn.preprocessing import PowerTransformer
X_train,X_test=[],[]
# Log Transform: Applies log transformation to reduce skewness | Best with outliers
df['log_transform'] = np.log1p(df['column'])

# Box-Cox Transform: Transforms data to make it more normally distributed | Sensitive to high values
df['boxcox_transform'], _ = stats.boxcox(df['column'])
pt = PowerTransformer(method='box-cox') # Default method: Yeo-Johnson
X_train_transformed = pt.fit_transform(X_train+0.000001)
X_test_transformed = pt.transform(X_test+0.000001)
pd.DataFrame({'cols':X_train.columns,'box_cox_lambdas':pt.lambdas_})

# Yeo-Johnson Transform: Transforms data to make it more normally distributed (handles zero and negative values) | Better than Box-Cox
df['yeojohnson_transform'], _ = stats.yeojohnson(df['column'])

# Quantile Transform: Transforms data to follow a uniform or normal distribution | Worst with Outliers
from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(output_distribution='normal')
df['quantile_transform'] = qt.fit_transform(df[['column']])

**2. Feature Construction**

In [None]:
import pandas as pd
import numpy as np

# Date Features: Extract year, month, day, and day of the week from date column
df['date'] = pd.to_datetime(df['date_column'])
df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.month
df['day'] = df['date'].dt.day
df['day_of_week'] = df['date'].dt.dayofweek

# Time Series Decomposition: Decompose time series into trend, seasonal, and residual components
from statsmodels.tsa.seasonal import seasonal_decompose
result = seasonal_decompose(df['time_series_column'], model='additive', period=365) # per year 
df['trend'] = result.trend
df['seasonal'] = result.seasonal
df['residual'] = result.resid

# Rolling Window Statistics: Calculate rolling mean and standard deviation with a window of 7 days
df['rolling_mean'] = df['column'].rolling(window=7).mean()
df['rolling_std'] = df['column'].rolling(window=7).std()

# Lag Features: Create lag features for previous periods
df['lag_1'] = df['column'].shift(1)
df['lag_7'] = df['column'].shift(7)

# Aggregation Features: Compute mean and sum based on categories
df['mean_by_category'] = df.groupby('category')['value'].transform('mean')
df['sum_by_category'] = df.groupby('category')['value'].transform('sum')

**3. Feature Selection**

In [55]:
from sklearn.feature_selection import SelectKBest, chi2, f_regression, mutual_info_regression
from sklearn.feature_selection import RFE
from sklearn.linear_model import Lasso, Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.impute import SimpleImputer, KNNImputer  

df = pd.read_csv("./data/train.csv")
# Convert categorical columns to one-hot encoded format
df_encoded = pd.get_dummies(df[['Sex', 'Embarked']])
# Join the encoded columns back to the original DataFrame
df = df.join(df_encoded).drop(['Sex', 'Embarked'], axis=1)
predicted = 'Survived'
df.drop(['Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

# K-Nearest Neighbors (KNN) Imputation: Replaces missing values using the mean value of the nearest neighbors
imputer = KNNImputer(n_neighbors=5)
df['Age'] = imputer.fit_transform(df[['Age']])
X = df.drop(predicted,axis=1)
y = df[predicted]

# Filter Methods
# Correlation-based Selection: Select features with high absolute correlation with the target
# Works well for continuous features to identify those strongly correlated with the target
correlation = df.corr(numeric_only=True)['Survived'].abs().sort_values(ascending=False)
selected_features_corr = correlation[correlation > 0.1].index.tolist()

# Chi-Square Test: Select features based on chi-square statistical test
# Suitable for categorical features; it assesses independence between features and target
selector = SelectKBest(chi2, k=10)
X_new = selector.fit_transform(X, y)
selected_features_chi2 = X.columns[selector.get_support()]

# ANOVA F-Test: Select features based on F-statistic from ANOVA test
# Used for continuous features in regression tasks; it measures the relationship between features and target
selector = SelectKBest(f_regression, k=10)
X_new_f = selector.fit_transform(X, y)
selected_features_f = X.columns[selector.get_support()]

# Mutual Information: Select features based on mutual information between features and target
# Captures non-linear relationships between features and target; works with both categorical and continuous features
selector = SelectKBest(mutual_info_regression, k=10)
X_new_mi = selector.fit_transform(X, y)
selected_features_mi = X.columns[selector.get_support()]

# Wrapper Methods
# Recursive Feature Elimination (RFE): Select features using a model to recursively eliminate least important features
# Uses a model to recursively select features based on performance, suitable for any feature type
rfe = RFE(estimator=RandomForestRegressor(), n_features_to_select=10)
X_new_rfe = rfe.fit_transform(X, y)
selected_features_rfe = X.columns[rfe.support_]

# Embedded Methods
# Lasso: Select features based on Lasso regression (features with non-zero coefficients)
# Regularization technique that performs feature selection by shrinking some coefficients to zero; works well with continuous features
lasso = Lasso(alpha=0.1)
lasso.fit(X, y)
selected_features_lasso = X.columns[lasso.coef_ != 0]

# Random Forest Importance: Select features based on feature importance from Random Forest
# Provides feature importance scores from a Random Forest model; useful for both continuous and categorical features
rf = RandomForestRegressor()
rf.fit(X, y)
importances = pd.DataFrame({'feature': X.columns, 'importance': rf.feature_importances_})
selected_features_rf = importances.nlargest(2, 'importance')['feature']  # Adjust n as needed


**4. Feature Extraction**

In [None]:
from sklearn.decomposition import PCA, KernelPCA, NMF
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.manifold import TSNE

# Principal Component Analysis (PCA)
# Reduces dimensionality by projecting data onto the directions of maximum variance.
pca = PCA(n_components=10)
X_pca = pca.fit_transform(X)

# Linear Discriminant Analysis (LDA)
# Reduces dimensionality by maximizing the separation between multiple classes.
lda = LinearDiscriminantAnalysis(n_components=10)
X_lda = lda.fit_transform(X, y)

# Kernel PCA
# Extends PCA to non-linear dimensionality reduction using kernel methods.
kpca = KernelPCA(n_components=10, kernel='rbf')
X_kpca = kpca.fit_transform(X)

# t-SNE (t-Distributed Stochastic Neighbor Embedding)
# Reduces dimensionality while preserving local structure, typically used for visualization.
tsne = TSNE(n_components=2)
X_tsne = tsne.fit_transform(X)

# Non-negative Matrix Factorization (NMF)
# Factorizes the matrix into non-negative matrices, useful for extracting parts-based representations.
nmf = NMF(n_components=10)
X_nmf = nmf.fit_transform(X)

______________________________________________________________________________

## Pipelines

In [107]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, PowerTransformer
from sklearn.feature_selection import SelectKBest, f_regression, chi2
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from faker import Faker
import warnings
warnings.filterwarnings("ignore")

# Initialize Faker for generating dummy data
fake = Faker()

# Create dummy data
data = {
    'age': np.random.randint(18, 70, size=10),  # Random ages between 18 and 70
    'income': np.random.randint(20000, 100000, size=10),  # Random incomes between $20,000 and $100,000
    'credit_score': np.random.randint(300, 850, size=10),  # Random credit scores between 300 and 850
    'gender': np.random.choice(['Male', 'Female'], size=10),  # Random genders
    'occupation': np.random.choice(['Engineer', 'Doctor', 'Artist', 'Teacher', 'Nurse'], size=10),  # Random occupations
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], size=10),  # Random cities
    'transaction_date': [fake.date_this_decade() for _ in range(10)],  # Random dates within this decade
    'description': [fake.sentence() for _ in range(10)],  # Random text descriptions
    'target': np.random.choice([0, 1], size=10)  # Random binary target values
}
# Create DataFrame
df = pd.DataFrame(data)
print(df.shape)
X = df.drop('target', axis=1)
y = df['target']
# Define column types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['gender', 'occupation', 'city']
date_features = ['transaction_date']
text_features = ['description']

# Custom transformer for Isolation Forest outlier removal
class IsolationForestOutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(self, contamination=0.1):
        self.contamination = contamination
        self.isolation_forest = IsolationForest(contamination=self.contamination, random_state=0)
    
    def fit(self, X, y=None):
        self.isolation_forest.fit(X)
        return self
    
    def transform(self, X):
        outliers = self.isolation_forest.predict(X) == -1
        return X[~outliers]

# 1. Feature Transformation Pipeline
feature_transformation = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')), # Imputer for missing values
        # ('outlier_remover', IsolationForestOutlierRemover()), # Outlier Removal, can use crafted classes as well
        ('yeo_johnson', PowerTransformer(method='yeo-johnson')), # Transformation
        ('scaler', MinMaxScaler())  # Scaling 
    ]), numeric_features),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]), categorical_features)
])

# 2. Feature Construction Pipeline
class DateFeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        X['year'] = pd.to_datetime(X['transaction_date']).dt.year
        X['month'] = pd.to_datetime(X['transaction_date']).dt.month
        X['day_of_week'] = pd.to_datetime(X['transaction_date']).dt.dayofweek
        return X[['year', 'month', 'day_of_week']]

class TextFeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        X['word_count'] = X['description'].str.split().str.len()
        X['char_count'] = X['description'].str.len()
        return X[['word_count', 'char_count']]

feature_construction = ColumnTransformer([
    ('date', DateFeatureExtractor(), date_features),
    ('text', TextFeatureExtractor(), text_features)
])

# 3. Feature Selection Pipeline
feature_selection = ColumnTransformer([
    ('num_select_best', Pipeline([
        ('select_best', SelectKBest(f_regression, k=5))
    ]), numeric_features),
    ('cat_select_best', Pipeline([
        ('onehot', OneHotEncoder(handle_unknown='ignore')),
        ('select_best', SelectKBest(chi2, k=5))
    ]), categorical_features),
    
])

# 4. Feature Extraction Pipeline
feature_extraction = ColumnTransformer([
    ('num_pca', Pipeline([
        ('imputer', SimpleImputer(strategy='median')),  # Impute missing values
        ('scaler', StandardScaler()),  # Scale the features
        ('pca', PCA(n_components=2))  # Apply PCA
    ]), numeric_features),
    ('cat_pca', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore')),
        ('pca', PCA(n_components=2))  # Apply PCA
    ]), categorical_features)
])

# Combine all pipelines
final_pipeline = FeatureUnion([
    ('transformation', feature_transformation),
    ('construction', feature_construction),
    ('selection', feature_selection),
    ('extraction', feature_extraction)
])

full_pipeline = Pipeline([
    ('features', final_pipeline),
    ('classifier', LogisticRegression())
])

# Prepare data for training
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fit and predict
full_pipeline.fit(X_train, y_train)
predictions = full_pipeline.predict(X_test)

# Display predictions
print(predictions)

(10, 9)
[0 1]


In [111]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler, PowerTransformer
from sklearn.feature_selection import SelectKBest, f_regression, chi2
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from faker import Faker
import warnings
warnings.filterwarnings("ignore")

# Initialize Faker for generating dummy data
fake = Faker()

# Create dummy data
data = {
    'age': np.random.randint(18, 70, size=10),  # Random ages between 18 and 70
    'income': np.random.randint(20000, 100000, size=10),  # Random incomes between $20,000 and $100,000
    'credit_score': np.random.randint(300, 850, size=10),  # Random credit scores between 300 and 850
    'gender': np.random.choice(['Male', 'Female'], size=10),  # Random genders
    'occupation': np.random.choice(['Engineer', 'Doctor', 'Artist', 'Teacher', 'Nurse'], size=10),  # Random occupations
    'city': np.random.choice(['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'], size=10),  # Random cities
    'transaction_date': [fake.date_this_decade() for _ in range(10)],  # Random dates within this decade
    'description': [fake.sentence() for _ in range(10)],  # Random text descriptions
    'target': np.random.choice([0, 1], size=10)  # Random binary target values
}
# Create DataFrame
df = pd.DataFrame(data)
print(df.shape)
X = df.drop('target', axis=1)
y = df['target']
# Define column types
numeric_features = ['age', 'income', 'credit_score']
categorical_features = ['gender', 'occupation', 'city']
date_features = ['transaction_date']
text_features = ['description']

# Custom transformer for Isolation Forest outlier removal
class IsolationForestOutlierRemover(BaseEstimator, TransformerMixin):
    def __init__(self, contamination=0.1):
        self.contamination = contamination
        self.isolation_forest = IsolationForest(contamination=self.contamination, random_state=0)
    
    def fit(self, X, y=None):
        self.isolation_forest.fit(X)
        return self
    
    def transform(self, X):
        outliers = self.isolation_forest.predict(X) == -1
        return X[~outliers]

class DateFeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        X['year'] = pd.to_datetime(X['transaction_date']).dt.year
        X['month'] = pd.to_datetime(X['transaction_date']).dt.month
        X['day_of_week'] = pd.to_datetime(X['transaction_date']).dt.dayofweek
        return X[['year', 'month', 'day_of_week']]

class TextFeatureExtractor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        X['word_count'] = X['description'].str.split().str.len()
        X['char_count'] = X['description'].str.len()
        return X[['word_count', 'char_count']]


# 1. Feature Transformation Pipeline
feature_transformation = ColumnTransformer([
    ('num', Pipeline([
        ('imputer', SimpleImputer(strategy='median')), # Imputer for missing values
        # ('outlier_remover', IsolationForestOutlierRemover()), # Outlier Removal, can use crafted classes as well
        ('yeo_johnson', PowerTransformer(method='yeo-johnson')), # Transformation
        ('scaler', MinMaxScaler()),  # Scaling 
        ('select_best', SelectKBest(f_regression, k=3)),
        ('pca', PCA(n_components=2)) 
    ]), numeric_features),
    ('cat', Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
        ('onehot', OneHotEncoder(handle_unknown='ignore')),
        ('select_best', SelectKBest(chi2, k=3)),
        ('pca', PCA(n_components=2)) 
    ]), categorical_features),
    ('date_ft',Pipeline([  
        ('date', DateFeatureExtractor()),
        # ('lda',LinearDiscriminantAnalysis(n_components=2)),
    ]),date_features ),
     ('text', TextFeatureExtractor(), text_features),
])

pipeline_new = Pipeline([
    ('features', feature_transformation),
    ('classifier', LogisticRegression())
])

# Prepare data for training
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Fit and predict
pipeline_new.fit(X_train, y_train)
predictions_new = pipeline_new.predict(X_test)

# Display predictions
print(predictions_new)

(10, 9)
[0 1]


In [113]:
# Apply Feature Transformation
# Check the output of each pipeline step
transformed_data = feature_transformation.fit_transform(X,y)
print("Data after Feature Transformation:")
print(transformed_data)
print("Shape:", transformed_data.shape)

# # Apply Feature Construction
# constructed_data = feature_construction.fit_transform(X)
# constructed_df = pd.DataFrame(constructed_data, columns=['year', 'month', 'day_of_week', 'word_count', 'char_count'])
# print("Data after Feature Construction:")
# print(constructed_df.head())
# print("Shape:", constructed_df.shape)

# # Apply Feature Selection
# selected_data = feature_selection.fit_transform(X, y)
# selected_df = pd.DataFrame(selected_data)
# print("Data after Feature Selection:")
# print(selected_df.head())
# print("Shape:", selected_df.shape)

# # Apply Feature Extraction
# extracted_data = feature_extraction.fit_transform(X)
# extracted_df = pd.DataFrame(extracted_data, columns=[f'component_{i}' for i in range(extracted_data.shape[1])])
# print("Data after Feature Extraction:")
# print(extracted_df.head())
# print("Shape:", extracted_df.shape)



Data after Feature Transformation:
[[ 1.41913746e-01  2.49718014e-01  6.49804158e-01  3.43092076e-01
   2.02000000e+03  9.00000000e+00  1.00000000e+00  3.00000000e+00
   1.80000000e+01]
 [-6.45828653e-01  4.13573284e-01  6.49804158e-01  3.43092076e-01
   2.02200000e+03  4.00000000e+00  4.00000000e+00  4.00000000e+00
   3.50000000e+01]
 [-5.89996829e-01 -3.28468791e-01 -6.60175386e-01  1.91251947e-01
   2.02400000e+03  1.00000000e+00  0.00000000e+00  7.00000000e+00
   3.90000000e+01]
 [ 6.55706104e-01  4.74995477e-01 -5.16052717e-02 -3.67737824e-01
   2.02200000e+03  3.00000000e+00  4.00000000e+00  3.00000000e+00
   2.10000000e+01]
 [ 1.84520834e-01 -5.04404370e-01 -5.16052717e-02 -3.67737824e-01
   2.02200000e+03  5.00000000e+00  3.00000000e+00  5.00000000e+00
   3.40000000e+01]
 [ 4.82128965e-01 -1.38310644e-01  6.49804158e-01  3.43092076e-01
   2.02300000e+03  5.00000000e+00  6.00000000e+00  5.00000000e+00
   3.00000000e+01]
 [-2.81788776e-01  1.80848545e-01 -5.16052717e-02 -3.677378

**Print your pipeline**

In [112]:
pipeline_new.fit(X_train, y_train) # type: ignore
# full_pipeline.fit(X_train, y_train)

**Check your pipeline**

In [116]:
full_pipeline.named_steps
# pipeline_new.named_steps

{'features': FeatureUnion(transformer_list=[('transformation',
                                 ColumnTransformer(transformers=[('num',
                                                                  Pipeline(steps=[('imputer',
                                                                                   SimpleImputer(strategy='median')),
                                                                                  ('yeo_johnson',
                                                                                   PowerTransformer()),
                                                                                  ('scaler',
                                                                                   MinMaxScaler())]),
                                                                  ['age',
                                                                   'income',
                                                                   'credit_score']),
                   

**Use Cross-Validation for better check**

In [117]:
from sklearn import set_config
set_config(display='diagram')

# cross validation using cross_val_score
from sklearn.model_selection import cross_val_score
# cross_val_score(full_pipeline, X_train, y_train, cv=5, scoring='accuracy').mean()
cross_val_score(pipeline_new, X_train, y_train, cv=5, scoring='accuracy').mean() # type: ignore

np.float64(0.7)

**Pickle your pipeline**

In [118]:
# export 
import pickle
pickle.dump(full_pipeline,open('./output/pipeline_feature.pkl','wb'))
pickle.dump(pipeline_new,open('./output/pipeline_feature2.pkl','wb')) # type: ignore

**Pipeline prediction on Titanic Dataset**

In [122]:
import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, OrdinalEncoder
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.metrics import accuracy_score, classification_report

# Load the data
data = pd.read_csv('./data/train.csv')
X = data.drop('Survived', axis=1)
y = data['Survived']

# Define feature groups
numeric_features = ['Age', 'Fare', 'SibSp', 'Parch']
categorical_features = ['Pclass', 'Sex', 'Embarked']
ordinal_features = ['Pclass']
nominal_features = ['Sex', 'Embarked']

# Custom transformer for feature construction
class FeatureConstructor(BaseEstimator, TransformerMixin):
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X = X.copy()
        # Extract title from name
        X['Title'] = X['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)
        # Group rare titles
        rare_titles = ['Lady', 'Countess','Capt', 'Col','Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona']
        X['Title'] = X['Title'].replace(rare_titles, 'Rare')
        # Family size
        X['FamilySize'] = X['SibSp'] + X['Parch'] + 1
        # Is alone
        X['IsAlone'] = (X['FamilySize'] == 1).astype(int)
        # Fare per person
        X['FarePerPerson'] = X['Fare'] / X['FamilySize']
        # Age * Class
        X['Age*Class'] = X['Age'] * X['Pclass']
        return X

# Numeric pipeline
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

# Categorical pipeline
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Ordinal pipeline
ordinal_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('ordinal', OrdinalEncoder())
])

# Preprocessor
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, nominal_features),
        ('ord', ordinal_transformer, ordinal_features)
    ])

# Feature engineering pipeline
feature_engineering = Pipeline([
    ('constructor', FeatureConstructor()),
    ('preprocessor', preprocessor),
    ('selector', SelectKBest(f_classif, k=15))
])

# Create different classifiers
rf_classifier = RandomForestClassifier(random_state=42)
lr_classifier = LogisticRegression(random_state=42)
svm_classifier = SVC(random_state=42)

# Create the final pipeline with different models
final_pipeline = {
    'RandomForest': Pipeline([
        ('features', feature_engineering),
        ('classifier', rf_classifier)
    ]),
    'LogisticRegression': Pipeline([
        ('features', feature_engineering),
        ('classifier', lr_classifier)
    ]),
    'SVM': Pipeline([
        ('features', feature_engineering),
        ('classifier', svm_classifier)
    ])
}

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train and evaluate each model
for name, pipeline in final_pipeline.items():
    print(f"\nTraining {name}...")
    pipeline.fit(X_train, y_train)
    y_pred = pipeline.predict(X_test)
    print(f"{name} Accuracy: {accuracy_score(y_test, y_pred):.4f}")
    print(f"{name} Classification Report:")
    print(classification_report(y_test, y_pred))

# Grid search for best model and parameters
param_grid = {
    'RandomForest': {
        'classifier__n_estimators': [100, 200],
        'classifier__max_depth': [5, 10, None],
        'features__selector__k': [10, 15, 20]
    },
    'LogisticRegression': {
        'classifier__C': [0.1, 1, 10],
        'classifier__penalty': ['l1', 'l2'],
        'features__selector__k': [10, 15, 20]
    },
    'SVM': {
        'classifier__C': [0.1, 1, 10],
        'classifier__kernel': ['rbf', 'linear'],
        'features__selector__k': [10, 15, 20]
    }
}

best_score = 0
best_model = None
best_params = None

for name, pipeline in final_pipeline.items():
    print(f"\nPerforming GridSearchCV for {name}...")
    grid_search = GridSearchCV(pipeline, param_grid[name], cv=5, scoring='accuracy', n_jobs=-1)
    grid_search.fit(X_train, y_train)
    if grid_search.best_score_ > best_score:
        best_score = grid_search.best_score_
        best_model = name
        best_params = grid_search.best_params_
    print(f"Best {name} score: {grid_search.best_score_:.4f}")
    print(f"Best {name} params: {grid_search.best_params_}")

print(f"\nOverall best model: {best_model}")
print(f"Best score: {best_score:.4f}")
print(f"Best parameters: {best_params}")

# Train the best model on the entire training set
best_pipeline = final_pipeline[best_model].set_params(**best_params)
best_pipeline.fit(X_train, y_train)

# Evaluate on the test set
y_pred = best_pipeline.predict(X_test)
print("\nBest Model Performance on Test Set:")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print("Classification Report:")
print(classification_report(y_test, y_pred))


Training RandomForest...
RandomForest Accuracy: 0.8212
RandomForest Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.87      0.85       105
           1       0.80      0.76      0.78        74

    accuracy                           0.82       179
   macro avg       0.82      0.81      0.81       179
weighted avg       0.82      0.82      0.82       179


Training LogisticRegression...
LogisticRegression Accuracy: 0.8101
LogisticRegression Classification Report:
              precision    recall  f1-score   support

           0       0.83      0.86      0.84       105
           1       0.79      0.74      0.76        74

    accuracy                           0.81       179
   macro avg       0.81      0.80      0.80       179
weighted avg       0.81      0.81      0.81       179


Training SVM...
SVM Accuracy: 0.8156
SVM Classification Report:
              precision    recall  f1-score   support

           0       0.82 