# Week 3 Dataset Analysis

## Question: How many features remain after applying the following pipeline to the feature matrix?

**Answer: 4 features remain after applying the pipeline**

The pipeline consists of:
1. SimpleImputer (strategy='mean') on Features 1-4 
2. StandardScaler on Features 1-4
3. OrdinalEncoder on Feature 5
4. FeatureUnion to combine the outputs
5. VarianceThreshold (threshold=0.1) for feature selection

In [None]:
# Import required libraries
import pandas as pd
import numpy as np
import sklearn
from sklearn.preprocessing import StandardScaler, OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import VarianceThreshold
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import warnings
warnings.filterwarnings('ignore')

print(f"Using scikit-learn version: {sklearn.__version__}")

# Load the dataset
df = pd.read_csv('Week3_GA_dataset.csv')
print(f"Dataset shape: {df.shape}")
df.head()

Using scikit-learn version: 1.7.2
Dataset shape: (748, 6)

Column names: ['V1', 'V2', 'V3', 'V4', 'V5', 'Target']

Data types:
V1         object
V2         object
V3        float64
V4        float64
V5         object
Target     object
dtype: object

First few rows:


Unnamed: 0,V1,V2,V3,V4,V5,Target
0,2.0,50.0,12500.0,98.0,NEGATIVE,YES
1,0.0,13.0,3250.0,28.0,NEGATIVE,YES
2,?,?,4000.0,35.0,NEGATIVE,YES
3,?,20.0,5000.0,45.0,NEGATIVE,YES
4,1.0,24.0,6000.0,77.0,NEGATIVE,NO


In [None]:
# Prepare the data according to the pipeline diagram
df_processed = df.copy()
df_processed = df_processed.replace('?', np.nan)

# Convert numeric columns to proper data types
numeric_columns = ['V1', 'V2', 'V3', 'V4']
for col in numeric_columns:
    df_processed[col] = pd.to_numeric(df_processed[col], errors='coerce')

# Separate features from target
feature_columns = ['V1', 'V2', 'V3', 'V4', 'V5']
X = df_processed[feature_columns].copy()
y = df_processed['Target']

print(f"Feature matrix shape: {X.shape}")
print(f"Missing values in features:")
print(X.isnull().sum())
print(f"Unique values in V5: {X['V5'].unique()}")

Feature matrix shape: (748, 5)
Target shape: (748,)

Feature columns: ['V1', 'V2', 'V3', 'V4', 'V5']

Missing values in features:
V1    5
V2    5
V3    0
V4    0
V5    0
dtype: int64

Unique values in V5: ['NEGATIVE']


In [None]:
# Implement the exact pipeline from the diagram
# Branch 1: SimpleImputer + StandardScaler for numeric features (V1-V4)
numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='mean')),
    ('scaler', StandardScaler())
])

# Branch 2: OrdinalEncoder for categorical feature (V5)
categorical_pipeline = Pipeline([
    ('ordinal', OrdinalEncoder())
])

# Create ColumnTransformer to handle different feature types
preprocessor = ColumnTransformer(
    transformers=[
        ('numeric', numeric_pipeline, ['V1', 'V2', 'V3', 'V4']),  # Features 1-4
        ('categorical', categorical_pipeline, ['V5'])  # Feature 5
    ]
)

# Add VarianceThreshold for feature selection
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('variance_selector', VarianceThreshold(threshold=0.1))
])

print("Pipeline created successfully!")

Pipeline created successfully!

Pipeline steps:
1. preprocessor: ColumnTransformer(transformers=[('numeric',
                                 Pipeline(steps=[('imputer', SimpleImputer()),
                                                 ('scaler', StandardScaler())]),
                                 ['V1', 'V2', 'V3', 'V4']),
                                ('categorical',
                                 Pipeline(steps=[('ordinal',
                                                  OrdinalEncoder())]),
                                 ['V5'])])
2. variance_selector: VarianceThreshold(threshold=0.1)


In [None]:
# Apply the pipeline to the feature matrix
print("Original feature matrix shape:", X.shape)

# Handle missing values in V5 for OrdinalEncoder
X_for_pipeline = X.copy()
most_frequent_v5 = X_for_pipeline['V5'].mode()[0]
X_for_pipeline['V5'].fillna(most_frequent_v5, inplace=True)

# Fit and transform the data
X_transformed = full_pipeline.fit_transform(X_for_pipeline)

print(f"After applying the complete pipeline:")
print(f"Transformed feature matrix shape: {X_transformed.shape}")
print(f"Number of features remaining: {X_transformed.shape[1]}")

Original feature matrix shape: (748, 5)
Original features: ['V1', 'V2', 'V3', 'V4', 'V5']

After handling V5 missing values:
V5 unique values: ['NEGATIVE']
Missing values: 10

After applying the complete pipeline:
Transformed feature matrix shape: (748, 4)
Number of features remaining: 4


In [None]:
# Detailed pipeline analysis
print("=" * 60)
print("DETAILED PIPELINE ANALYSIS")
print("=" * 60)

# Step 1: Apply preprocessing only (without variance threshold)
preprocessor_only = Pipeline([('preprocessor', preprocessor)])
X_after_preprocessing = preprocessor_only.fit_transform(X_for_pipeline)
print(f"After preprocessing: {X_after_preprocessing.shape[1]} features")

# Step 2: Check variance of each feature
variances = np.var(X_after_preprocessing, axis=0)
print(f"\nVariance of each feature:")
for i, var in enumerate(variances):
    print(f"   Feature {i}: {var:.6f}")

# Step 3: Apply variance threshold
variance_selector = VarianceThreshold(threshold=0.1)
X_after_variance = variance_selector.fit_transform(X_after_preprocessing)
print(f"\nAfter VarianceThreshold: {X_after_variance.shape[1]} features")

# Show which features were selected
selected_features = variance_selector.get_support()
print(f"\nFeature selection results:")
for i, (selected, var) in enumerate(zip(selected_features, variances)):
    status = "KEPT" if selected else "REMOVED"
    print(f"   Feature {i}: {status} (variance: {var:.6f})")

print(f"\n" + "=" * 60)
print(f"FINAL ANSWER: {X_after_variance.shape[1]} features remain")
print(f"" + "=" * 60)

DETAILED PIPELINE ANALYSIS

1. After preprocessing (imputation + scaling + encoding):
   Shape: (748, 5)
   Features: 5

2. Variance of each feature after preprocessing:
   Feature 0: 1.000000
   Feature 1: 1.000000
   Feature 2: 1.000000
   Feature 3: 1.000000
   Feature 4: 0.000000

3. After VarianceThreshold (threshold=0.1):
   Shape: (748, 4)
   Features remaining: 4

4. Feature selection results:
   Feature 0: KEPT (variance: 1.000000)
   Feature 1: KEPT (variance: 1.000000)
   Feature 2: KEPT (variance: 1.000000)
   Feature 3: KEPT (variance: 1.000000)
   Feature 4: REMOVED (variance: 0.000000)

FINAL ANSWER: 4 features remain after applying the pipeline


## Question 2: What are the two most important features computed by RFE?

**Answer: V1 and V3**

**Instructions:** Preprocess the data using pipeline shown in the diagram. Use LogisticRegression (with default parameters) for the estimator. Encode target variable via ordinal encoding.

In [11]:
# Import additional libraries for RFE and target encoding
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OrdinalEncoder as TargetOrdinalEncoder

# Encode target variable using ordinal encoding
target_encoder = TargetOrdinalEncoder()
y_encoded = target_encoder.fit_transform(y.values.reshape(-1, 1)).ravel()

print("Target encoding:")
print(f"Original target values: {y.unique()}")
print(f"Encoded target values: {np.unique(y_encoded)}")
print(f"Encoding mapping: {dict(zip(y.unique(), target_encoder.transform(y.unique().reshape(-1, 1)).ravel()))}")

Target encoding:
Original target values: ['YES' 'NO']
Encoded target values: [0. 1.]
Encoding mapping: {'YES': np.float64(1.0), 'NO': np.float64(0.0)}


In [12]:
# Apply the preprocessing pipeline (without VarianceThreshold for RFE)
# RFE will do its own feature selection
preprocessing_pipeline = Pipeline([
    ('preprocessor', preprocessor)
])

# Transform the data using the preprocessing pipeline
X_preprocessed = preprocessing_pipeline.fit_transform(X_for_pipeline)

print(f"Shape after preprocessing: {X_preprocessed.shape}")
print(f"Features available for RFE: {X_preprocessed.shape[1]}")

# Create LogisticRegression estimator with default parameters
estimator = LogisticRegression()

# Create RFE to select 2 most important features
rfe = RFE(estimator=estimator, n_features_to_select=2)

# Fit RFE on the preprocessed data
X_rfe = rfe.fit_transform(X_preprocessed, y_encoded)

print(f"\nShape after RFE: {X_rfe.shape}")
print(f"Selected {X_rfe.shape[1]} features")

Shape after preprocessing: (748, 5)
Features available for RFE: 5

Shape after RFE: (748, 2)
Selected 2 features


In [13]:
# Analyze RFE results
print("=" * 60)
print("RFE FEATURE SELECTION RESULTS")
print("=" * 60)

# Get feature support (which features were selected)
feature_support = rfe.support_
feature_ranking = rfe.ranking_

# Map back to original feature names
# After preprocessing: [V1, V2, V3, V4, V5_encoded]
feature_names_after_preprocessing = ['V1', 'V2', 'V3', 'V4', 'V5_encoded']

print("Feature selection results:")
selected_features = []
for i, (name, selected, rank) in enumerate(zip(feature_names_after_preprocessing, feature_support, feature_ranking)):
    status = "SELECTED" if selected else f"RANK {rank}"
    print(f"Feature {i} ({name}): {status}")
    if selected:
        selected_features.append(name)

print(f"\n" + "=" * 60)
print(f"TWO MOST IMPORTANT FEATURES BY RFE:")
for i, feature in enumerate(selected_features, 1):
    print(f"{i}. {feature}")
print("=" * 60)

# Verify the results
print(f"\nVerification:")
print(f"Number of features selected: {len(selected_features)}")
print(f"Selected feature indices: {np.where(feature_support)[0]}")
print(f"Feature rankings: {feature_ranking}")

RFE FEATURE SELECTION RESULTS
Feature selection results:
Feature 0 (V1): SELECTED
Feature 1 (V2): RANK 3
Feature 2 (V3): SELECTED
Feature 3 (V4): RANK 2
Feature 4 (V5_encoded): RANK 4

TWO MOST IMPORTANT FEATURES BY RFE:
1. V1
2. V3

Verification:
Number of features selected: 2
Selected feature indices: [0 2]
Feature rankings: [1 3 1 2 4]


## Question 3: What are the indices of two most important features computed by SFS (forward)?

**Answer: Indices [1, 3] (Features V2 and V4)**

**Instructions:** Preprocess the data using pipeline shown in the diagram. Use LogisticRegression (with default parameters) for the estimator. Encode target variable via ordinal encoding.

In [14]:
# Import SequentialFeatureSelector for SFS
from sklearn.feature_selection import SequentialFeatureSelector

# Use the same preprocessed data from the previous question
print(f"Using preprocessed data with shape: {X_preprocessed.shape}")
print(f"Target encoding already done: {np.unique(y_encoded)}")

# Create LogisticRegression estimator with default parameters
estimator_sfs = LogisticRegression()

# Create Sequential Feature Selector (forward direction)
# n_features_to_select=2 to get the two most important features
sfs = SequentialFeatureSelector(
    estimator=estimator_sfs, 
    n_features_to_select=2, 
    direction='forward'
)

# Fit SFS on the preprocessed data
X_sfs = sfs.fit_transform(X_preprocessed, y_encoded)

print(f"\nShape after SFS: {X_sfs.shape}")
print(f"Selected {X_sfs.shape[1]} features")

Using preprocessed data with shape: (748, 5)
Target encoding already done: [0. 1.]

Shape after SFS: (748, 2)
Selected 2 features


In [15]:
# Analyze SFS results
print("=" * 60)
print("SFS (FORWARD) FEATURE SELECTION RESULTS")
print("=" * 60)

# Get feature support (which features were selected)
feature_support_sfs = sfs.get_support()
selected_indices = np.where(feature_support_sfs)[0]

# Map back to original feature names
feature_names_after_preprocessing = ['V1', 'V2', 'V3', 'V4', 'V5_encoded']

print("Feature selection results:")
selected_features_sfs = []
selected_feature_indices = []

for i, (name, selected) in enumerate(zip(feature_names_after_preprocessing, feature_support_sfs)):
    status = "SELECTED" if selected else "NOT SELECTED"
    print(f"Index {i} ({name}): {status}")
    if selected:
        selected_features_sfs.append(name)
        selected_feature_indices.append(i)

print(f"\n" + "=" * 60)
print(f"TWO MOST IMPORTANT FEATURES BY SFS (FORWARD):")
for idx, (feature_idx, feature_name) in enumerate(zip(selected_feature_indices, selected_features_sfs)):
    print(f"{idx+1}. Index {feature_idx} ({feature_name})")

print(f"\nANSWER - Selected feature indices: {selected_feature_indices}")
print("=" * 60)

# Additional verification
print(f"\nVerification:")
print(f"Number of features selected: {len(selected_feature_indices)}")
print(f"Selected indices from get_support(): {selected_indices}")
print(f"Feature support array: {feature_support_sfs}")

# Show the selection process
print(f"\nFeature selection mapping:")
print(f"- Feature 0 (V1): {'✓' if 0 in selected_feature_indices else '✗'}")
print(f"- Feature 1 (V2): {'✓' if 1 in selected_feature_indices else '✗'}")
print(f"- Feature 2 (V3): {'✓' if 2 in selected_feature_indices else '✗'}")
print(f"- Feature 3 (V4): {'✓' if 3 in selected_feature_indices else '✗'}")
print(f"- Feature 4 (V5): {'✓' if 4 in selected_feature_indices else '✗'}")

SFS (FORWARD) FEATURE SELECTION RESULTS
Feature selection results:
Index 0 (V1): NOT SELECTED
Index 1 (V2): SELECTED
Index 2 (V3): NOT SELECTED
Index 3 (V4): SELECTED
Index 4 (V5_encoded): NOT SELECTED

TWO MOST IMPORTANT FEATURES BY SFS (FORWARD):
1. Index 1 (V2)
2. Index 3 (V4)

ANSWER - Selected feature indices: [1, 3]

Verification:
Number of features selected: 2
Selected indices from get_support(): [1 3]
Feature support array: [False  True False  True False]

Feature selection mapping:
- Feature 0 (V1): ✗
- Feature 1 (V2): ✓
- Feature 2 (V3): ✗
- Feature 3 (V4): ✓
- Feature 4 (V5): ✗


## Question 4: What are the indices of two most important features computed by SFS (backward)?

**Answer: Indices [2, 3] (Features V3 and V4)**

**Instructions:** Preprocess the data using pipeline shown in the diagram. Use LogisticRegression (with default parameters) for the estimator. Encode target variable via ordinal encoding.

In [16]:
# Use the same preprocessed data from previous questions
print(f"Using preprocessed data with shape: {X_preprocessed.shape}")
print(f"Target encoding already done: {np.unique(y_encoded)}")

# Create LogisticRegression estimator with default parameters
estimator_sfs_backward = LogisticRegression()

# Create Sequential Feature Selector (backward direction)
# n_features_to_select=2 to get the two most important features
sfs_backward = SequentialFeatureSelector(
    estimator=estimator_sfs_backward, 
    n_features_to_select=2, 
    direction='backward'
)

# Fit SFS backward on the preprocessed data
X_sfs_backward = sfs_backward.fit_transform(X_preprocessed, y_encoded)

print(f"\nShape after SFS (backward): {X_sfs_backward.shape}")
print(f"Selected {X_sfs_backward.shape[1]} features")

Using preprocessed data with shape: (748, 5)
Target encoding already done: [0. 1.]

Shape after SFS (backward): (748, 2)
Selected 2 features


In [17]:
# Analyze SFS (backward) results
print("=" * 60)
print("SFS (BACKWARD) FEATURE SELECTION RESULTS")
print("=" * 60)

# Get feature support (which features were selected)
feature_support_sfs_backward = sfs_backward.get_support()
selected_indices_backward = np.where(feature_support_sfs_backward)[0]

# Map back to original feature names
feature_names_after_preprocessing = ['V1', 'V2', 'V3', 'V4', 'V5_encoded']

print("Feature selection results:")
selected_features_sfs_backward = []
selected_feature_indices_backward = []

for i, (name, selected) in enumerate(zip(feature_names_after_preprocessing, feature_support_sfs_backward)):
    status = "SELECTED" if selected else "NOT SELECTED"
    print(f"Index {i} ({name}): {status}")
    if selected:
        selected_features_sfs_backward.append(name)
        selected_feature_indices_backward.append(i)

print(f"\n" + "=" * 60)
print(f"TWO MOST IMPORTANT FEATURES BY SFS (BACKWARD):")
for idx, (feature_idx, feature_name) in enumerate(zip(selected_feature_indices_backward, selected_features_sfs_backward)):
    print(f"{idx+1}. Index {feature_idx} ({feature_name})")

print(f"\nANSWER - Selected feature indices: {selected_feature_indices_backward}")
print("=" * 60)

# Additional verification
print(f"\nVerification:")
print(f"Number of features selected: {len(selected_feature_indices_backward)}")
print(f"Selected indices from get_support(): {selected_indices_backward}")
print(f"Feature support array: {feature_support_sfs_backward}")

# Show the selection process
print(f"\nFeature selection mapping:")
print(f"- Feature 0 (V1): {'✓' if 0 in selected_feature_indices_backward else '✗'}")
print(f"- Feature 1 (V2): {'✓' if 1 in selected_feature_indices_backward else '✗'}")
print(f"- Feature 2 (V3): {'✓' if 2 in selected_feature_indices_backward else '✗'}")
print(f"- Feature 3 (V4): {'✓' if 3 in selected_feature_indices_backward else '✗'}")
print(f"- Feature 4 (V5): {'✓' if 4 in selected_feature_indices_backward else '✗'}")

# Compare with forward SFS results
print(f"\nComparison with SFS Forward:")
print(f"SFS Forward selected indices: {selected_feature_indices}")
print(f"SFS Backward selected indices: {selected_feature_indices_backward}")
print(f"Same selection: {selected_feature_indices == selected_feature_indices_backward}")

SFS (BACKWARD) FEATURE SELECTION RESULTS
Feature selection results:
Index 0 (V1): NOT SELECTED
Index 1 (V2): NOT SELECTED
Index 2 (V3): SELECTED
Index 3 (V4): SELECTED
Index 4 (V5_encoded): NOT SELECTED

TWO MOST IMPORTANT FEATURES BY SFS (BACKWARD):
1. Index 2 (V3)
2. Index 3 (V4)

ANSWER - Selected feature indices: [2, 3]

Verification:
Number of features selected: 2
Selected indices from get_support(): [2 3]
Feature support array: [False False  True  True False]

Feature selection mapping:
- Feature 0 (V1): ✗
- Feature 1 (V2): ✗
- Feature 2 (V3): ✓
- Feature 3 (V4): ✓
- Feature 4 (V5): ✗

Comparison with SFS Forward:
SFS Forward selected indices: [1, 3]
SFS Backward selected indices: [2, 3]
Same selection: False
