# Scikit-learn Pre-processing functions

## Feature Scaling
Scikit-learn provides several methods for feature scaling, like `StandardScaler`, `MinMaxScaler`, `RobustScaler`, and `MaxAbsScaler`. 

| Common Methods | Description |
|----------------|-------------|
|`fit(X, y=None)`| Compute scaling parameters from training data |
|`transform(X)` |Apply scaling transformation to data|
|`fit_transform(X, y=None)`| Fit and transform in one step|
|`inverse_transform(X)`| Reverse the scaling transformation|
|`get_params(deep=True)`| Get parameters for this estimator|
|`set_params(**params)`| Set parameters for this estimator |

| Attribute | Description |
|-----------|-------------|
|`StandardScaler`| <table><tr><td>mean_</td><td>Mean of each feature</td></tr><tr><td>scale_</td><td>Standard deviation of each feature</td></tr><tr><td>var_</td><td>Variance of each feature</td></tr><tr><td>n_features_in_</td><td>Number of features seen during fit</td></tr><tr><td>feature_names_in_</td><td>Names of features seen during fit</td></tr></table> |
|`MinMaxScaler`| <table><tr><td>data_min_</td><td>Minimum value of each feature</td></tr><tr><td>data_max_</td><td>Maximum value of each feature</td></tr><tr><td>data_range_</td><td>Range of each feature</td></tr><tr><td>min_</td><td>Minimum value of each feature after scaling</td></tr><tr><td>max_</td><td>Maximum value of each feature after scaling</td></tr><tr><td>scale_</td><td>Scale factor for each feature</td></tr><tr><td>n_features_in_</td><td>Number of features seen during fit</td></tr><tr><td>feature_names_in_</td><td>Names of features seen during fit</td></tr></table> |
|`RobustScaler`| <table><tr><td>center_</td><td>Median of each feature</td></tr><tr><td>scale_</td><td>Interquartile range of each feature</td></tr><tr><td>n_features_in_</td><td>Number of features seen during fit</td></tr><tr><td>feature_names_in_</td><td>Names of features seen during fit</td></tr></table> |
|`MaxAbsScaler`| <table><tr><td>max_abs_</td><td>Maximum absolute value of each feature after scaling</td></tr><tr><td>scale_</td><td>Maximum absolute value of each feature</td></tr><tr><td>n_features_in_</td><td>Number of features seen during fit</td></tr><tr><td>feature_names_in_</td><td>Names of features seen during fit</td></tr></table> |


In [2]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler, MaxAbsScaler
import numpy as np

# Example data
data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print("Original Data:\n", data)

# Standardization
scaler_standard = StandardScaler()
data_standardized = scaler_standard.fit_transform(data)
print("Standardized Data:\n", data_standardized)

# Min-Max Scaling
scaler_minmax = MinMaxScaler()
data_minmax = scaler_minmax.fit_transform(data)
print("Min-Max Scaled Data:\n", data_minmax)

# Robust Scaling
scaler_robust = RobustScaler()
data_robust = scaler_robust.fit_transform(data)
print("Robust Scaled Data:\n", data_robust)

# MaxAbs Scaling
scaler_maxabs = MaxAbsScaler()
data_maxabs = scaler_maxabs.fit_transform(data)
print("MaxAbs Scaled Data:\n", data_maxabs)

Original Data:
 [[1 2 3]
 [4 5 6]
 [7 8 9]]
Standardized Data:
 [[-1.22474487 -1.22474487 -1.22474487]
 [ 0.          0.          0.        ]
 [ 1.22474487  1.22474487  1.22474487]]
Min-Max Scaled Data:
 [[0.  0.  0. ]
 [0.5 0.5 0.5]
 [1.  1.  1. ]]
Robust Scaled Data:
 [[-1. -1. -1.]
 [ 0.  0.  0.]
 [ 1.  1.  1.]]
MaxAbs Scaled Data:
 [[0.14285714 0.25       0.33333333]
 [0.57142857 0.625      0.66666667]
 [1.         1.         1.        ]]


## Feature encoding
Scikit-learn provides several methods for feature encoding, like `LabelEncoder`, `OrdinalEncoder` and `OneHotEncoder`
| Common Methods | Description |
|----------------|-------------|
|`fit(X, y=None)`| Compute encoding parameters from training data |
|`transform(X)` |Apply encoding transformation to data|
|`fit_transform(X, y=None)`| Fit and transform in one step|
|`inverse_transform(X)`| Reverse the encoding transformation|
|`get_params(deep=True)`| Get parameters for this estimator|
|`set_params(**params)`| Set parameters for this estimator |

| Attribute | Description |
|-----------|-------------|
|`LabelEncoder`| <table><tr><td>classes_</td><td>Unique classes in the data</td></tr><tr><td>n_classes_</td><td>Number of unique classes</td></tr></table> |
|`OrdinalEncoder`| <table><tr><td>categories_</td><td>Categories for each feature</td></tr><tr><td>n_features_in_</td><td>Number of features seen during fit</td></tr><tr><td>feature_names_in_</td><td>Names of features seen during fit</td></tr></table> |
|`OneHotEncoder`| <table><tr><td>categories_</td><td>Categories for each feature</td></tr><tr><td>n_values_</td><td>Number of unique values for each feature</td></tr><tr><td>n_features_in_</td><td>Number of features seen during fit</td></tr><tr><td>feature_names_in_</td><td>Names of features seen during fit</td></tr></table> |





In [None]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder
import pandas as pd

# Example categorical data
categorical_data = pd.DataFrame([['red', 'small'], ['blue', 'large'], ['green', 'medium']])
print("Orignal Data:\n", categorical_data)

# LabelEncoder
label_encoder = LabelEncoder()
encoded_labels = label_encoder.fit_transform(categorical_data[0]) 
print("Encoded Labels:\n", encoded_labels)

# OrdinalEncoder
ordinal_encoder = OrdinalEncoder(categories=[['red', 'blue', 'green'], ['small', 'medium', 'large']]) #specifying the order of categories
ordinal_encoded = ordinal_encoder.fit_transform(categorical_data)
print("Ordinal Encoded Data:\n", ordinal_encoded)

# OneHotEncoder
onehot_encoder_sparse = OneHotEncoder(sparse_output=True) # sparse_output=True returns a sparse matrix
onehot_encoded_sparse = onehot_encoder_sparse.fit_transform(categorical_data)
onehot_encoder_dense = OneHotEncoder(sparse_output=False) # sparse_output=False returns a dense matrix
onehot_encoded_dense = onehot_encoder_dense.fit_transform(categorical_data)
print("OneHot Encoded Sparse Data:\n", onehot_encoded_sparse)
print("OneHot Encoded Dense Data:\n", onehot_encoded_dense)

# Dummy Encoding
dummy_encoder = OneHotEncoder(drop='first', sparse_output=False) # drop='first' drops the first category to archieve dummy encoding
dummy_encoded = dummy_encoder.fit_transform(categorical_data[[0]])
print("Dummy Encoded Data:\n", dummy_encoded)

Orignal Data:
        0       1
0    red   small
1   blue   large
2  green  medium
Encoded Labels:
 [2 0 1]
Ordinal Encoded Data:
 [[0. 0.]
 [1. 2.]
 [2. 1.]]
OneHot Encoded Sparse Data:
   (0, 2)	1.0
  (0, 5)	1.0
  (1, 0)	1.0
  (1, 3)	1.0
  (2, 1)	1.0
  (2, 4)	1.0
OneHot Encoded Dense Data:
 [[0. 0. 1. 0. 0. 1.]
 [1. 0. 0. 1. 0. 0.]
 [0. 1. 0. 0. 1. 0.]]
Dummy Encoded Data:
 [[0. 1.]
 [0. 0.]
 [1. 0.]]


## Missing Value Imputation

Scikit-learn provides several methods for handling missing values through the `sklearn.impute` module.

| Common Methods | Description |
|----------------|-------------|
|`fit(X, y=None)`| Learn imputation strategy from training data |
|`transform(X)` |Apply imputation to data|
|`fit_transform(X, y=None)`| Fit and transform in one step|
|`get_params(deep=True)`| Get parameters for this estimator|
|`set_params(**params)`| Set parameters for this estimator |

  

| Imputer | Description | Key Parameters |
|---------|-------------|----------------|
|`SimpleImputer`| Basic imputation strategies | `strategy`: 'mean', 'median', 'most_frequent', 'constant'<br>`fill_value`: Value for constant strategy |
|`IterativeImputer`| Multivariate imputation using other features | `estimator`: Model to use for prediction,default is `BayesianRidge()`<br>`max_iter`: Maximum iterations<br>`random_state`: For reproducibility |
|`KNNImputer`| K-Nearest Neighbors imputation | `n_neighbors`: Number of neighbors<br>`weights`: 'uniform' or 'distance' |
|`MissingIndicator`| Creates binary indicators for missing values | `features`: 'missing-only' or 'all'<br>`sparse`: Return sparse matrix or not |

| Attributes | SimpleImputer | IterativeImputer | KNNImputer |
|-----------|---------------|------------------|------------|
|`statistics_`| Imputation values per feature | ✓ | ✗ | ✗ |
|`indicator_`| MissingIndicator object | ✓ | ✓ | ✓ |
|`n_features_in_`| Number of features seen during fit | ✓ | ✓ | ✓ |
|`feature_names_in_`| Names of features seen during fit | ✓ | ✓ | ✓ |

In [5]:
import numpy as np
import pandas as pd

# Create sample data with missing values
np.random.seed(42)
data = np.array([
    [1, 2, 3, 4],
    [5, np.nan, 7, 8],
    [9, 10, np.nan, 12],
    [13, 14, 15, np.nan],
    [np.nan, 18, 19, 20],
    [21, 22, 23, 24]
])

df = pd.DataFrame(data, columns=['A', 'B', 'C', 'D'])
print("Original Data with Missing Values:")
print(df)

# 1. SimpleImputer - Different strategies
from sklearn.impute import SimpleImputer

print("\n=== SimpleImputer Examples ===")

# Mean imputation
imputer_mean = SimpleImputer(strategy='mean')
data_mean = imputer_mean.fit_transform(data)
print(f"\nMean Imputation:")
print(pd.DataFrame(data_mean, columns=['A', 'B', 'C', 'D']))
print(f"Imputation values: {imputer_mean.statistics_}")

# Median imputation
imputer_median = SimpleImputer(strategy='median')
data_median = imputer_median.fit_transform(data)
print(f"\nMedian Imputation:")
print(pd.DataFrame(data_median, columns=['A', 'B', 'C', 'D']))

# Constant value imputation
imputer_constant = SimpleImputer(strategy='constant', fill_value=999)
data_constant = imputer_constant.fit_transform(data)
print(f"\nConstant Imputation (999):")
print(pd.DataFrame(data_constant, columns=['A', 'B', 'C', 'D']))

# 2. IterativeImputer - Multivariate imputation
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

print(f"\n=== IterativeImputer Example ===")
# Using BayesianRidge as default estimator
# You can also specify other estimators like DecisionTreeRegressor, RandomForestRegressor, etc.
imputer_iterative_default = IterativeImputer(random_state=42, max_iter=10)
data_iterative_default = imputer_iterative_default.fit_transform(data)
print(f"Default Iterative Imputation:")
print(pd.DataFrame(data_iterative_default, columns=['A', 'B', 'C', 'D']))
# Using a different estimator (e.g., DecisionTreeRegressor)
from sklearn.tree import DecisionTreeRegressor
imputer_iterative_tree = IterativeImputer(estimator=DecisionTreeRegressor(), random_state=42, max_iter=10)
data_iterative_tree = imputer_iterative_tree.fit_transform(data)
print(f"\nIterative Imputation with DecisionTreeRegressor:")
print(pd.DataFrame(data_iterative_tree, columns=['A', 'B', 'C', 'D']))

# 3. KNNImputer
from sklearn.impute import KNNImputer

print(f"\n=== KNNImputer Example ===")
imputer_knn = KNNImputer(n_neighbors=2)
data_knn = imputer_knn.fit_transform(data)
print(f"KNN Imputation (k=2):")
print(pd.DataFrame(data_knn, columns=['A', 'B', 'C', 'D']))

# 4. MissingIndicator
from sklearn.impute import MissingIndicator

print(f"\n=== MissingIndicator Example ===")
indicator = MissingIndicator()
missing_mask = indicator.fit_transform(data)
print(f"Missing Value Indicators:")
print(pd.DataFrame(missing_mask, columns=[f'missing_{col}' for col in ['A', 'B', 'C', 'D']]))
print(f"Features with missing values: {indicator.features_}")

Original Data with Missing Values:
      A     B     C     D
0   1.0   2.0   3.0   4.0
1   5.0   NaN   7.0   8.0
2   9.0  10.0   NaN  12.0
3  13.0  14.0  15.0   NaN
4   NaN  18.0  19.0  20.0
5  21.0  22.0  23.0  24.0

=== SimpleImputer Examples ===

Mean Imputation:
      A     B     C     D
0   1.0   2.0   3.0   4.0
1   5.0  13.2   7.0   8.0
2   9.0  10.0  13.4  12.0
3  13.0  14.0  15.0  13.6
4   9.8  18.0  19.0  20.0
5  21.0  22.0  23.0  24.0
Imputation values: [ 9.8 13.2 13.4 13.6]

Median Imputation:
      A     B     C     D
0   1.0   2.0   3.0   4.0
1   5.0  14.0   7.0   8.0
2   9.0  10.0  15.0  12.0
3  13.0  14.0  15.0  12.0
4   9.0  18.0  19.0  20.0
5  21.0  22.0  23.0  24.0

Constant Imputation (999):
       A      B      C      D
0    1.0    2.0    3.0    4.0
1    5.0  999.0    7.0    8.0
2    9.0   10.0  999.0   12.0
3   13.0   14.0   15.0  999.0
4  999.0   18.0   19.0   20.0
5   21.0   22.0   23.0   24.0

=== IterativeImputer Example ===
Default Iterative Imputation:
      



### Key Considerations for Missing Value Imputation:
1. Choose Strategy Based on Data Type:
- **Numerical**: Mean (normal distribution), Median (skewed/outliers), KNN (correlated features)
- **Categorical**: Most frequent, Constant value
2. Consider Missing Data Mechanism:
- **MCAR (Missing Completely At Random)**: Simple strategies work well
- **MAR (Missing At Random)**: IterativeImputer or KNNImputer better
- **MNAR (Missing Not At Random)**: Domain-specific strategies needed
3. Evaluation Impact:
- Always evaluate model performance with and without imputation
- Consider adding missing value indicators as additional features
- Be cautious with high missing data percentages (>50%)
4. Production Considerations:
- Fit imputers only on training data
- Save imputation statistics for consistent preprocessing
- Handle new missing patterns in production data

In [4]:
## How IterativeImputer Works

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.linear_model import LinearRegression, BayesianRidge
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Create example data to demonstrate the iterative process
np.random.seed(42)
n_samples = 100
X_complete = np.random.randn(n_samples, 4)
# Add some correlation between features
X_complete[:, 1] = X_complete[:, 0] + 0.5 * np.random.randn(n_samples)
X_complete[:, 2] = X_complete[:, 0] + X_complete[:, 1] + 0.3 * np.random.randn(n_samples)
X_complete[:, 3] = X_complete[:, 2] + 0.4 * np.random.randn(n_samples)

# Introduce missing values
X_missing = X_complete.copy()
missing_rate = 0.3
for col in range(4):
    missing_indices = np.random.choice(n_samples, int(n_samples * missing_rate), replace=False)
    X_missing[missing_indices, col] = np.nan

df_missing = pd.DataFrame(X_missing, columns=['Feature_A', 'Feature_B', 'Feature_C', 'Feature_D'])
print("Data with Missing Values (first 10 rows):")
print(df_missing.head(10))
print(f"\nMissing values per feature:")
print(df_missing.isnull().sum())

# Step-by-step demonstration of IterativeImputer
print("\n" + "="*60)
print("ITERATIVE IMPUTATION PROCESS")
print("="*60)

# Initialize with simple imputation (mean)
from sklearn.impute import SimpleImputer
initial_imputer = SimpleImputer(strategy='mean')
X_initial = initial_imputer.fit_transform(X_missing)

print("\nStep 1: Initial imputation using mean values")
print("Initial imputation values:", initial_imputer.statistics_)

# Show the iterative process manually
def demonstrate_iteration(X_missing, iteration_num, estimator=LinearRegression()):
    print(f"\n--- Iteration {iteration_num} ---")
    X_iter = X_initial.copy()
    
    for feature_idx in range(X_missing.shape[1]):
        # Find rows where this feature is missing
        missing_mask = np.isnan(X_missing[:, feature_idx])
        
        if np.any(missing_mask):
            # Use other features to predict this feature
            other_features = np.delete(np.arange(X_missing.shape[1]), feature_idx)
            
            # Training data: rows where current feature is NOT missing
            train_mask = ~missing_mask
            X_train = X_iter[train_mask][:, other_features]
            y_train = X_iter[train_mask, feature_idx]
            
            # Predict missing values
            if len(X_train) > 0:
                estimator.fit(X_train, y_train)
                X_predict = X_iter[missing_mask][:, other_features]
                predicted_values = estimator.predict(X_predict)
                
                print(f"Feature {feature_idx}: Predicted {len(predicted_values)} missing values")
                print(f"  Mean predicted value: {predicted_values.mean():.3f}")
                
                # Update the missing values
                X_iter[missing_mask, feature_idx] = predicted_values
    
    return X_iter

# Demonstrate a few iterations manually
X_iter1 = demonstrate_iteration(X_missing, 1)
X_iter2 = demonstrate_iteration(X_missing, 2)

# Now use the actual IterativeImputer
print("\n" + "="*60)
print("USING SKLEARN ITERATIVEIMPUTER")
print("="*60)

# Default IterativeImputer (uses BayesianRidge)
imputer_default = IterativeImputer(random_state=42, max_iter=10, verbose=1)
X_imputed_default = imputer_default.fit_transform(X_missing)

print(f"\nDefault IterativeImputer completed in {imputer_default.n_iter_} iterations")

# IterativeImputer with different estimators
estimators = {
    'BayesianRidge': BayesianRidge(),
    'LinearRegression': LinearRegression(),
}

results = {}
for name, estimator in estimators.items():
    imputer = IterativeImputer(estimator=estimator, random_state=42, max_iter=5)
    X_imputed = imputer.fit_transform(X_missing)
    results[name] = X_imputed
    print(f"\n{name} - Iterations: {imputer.n_iter_}")

# Compare results
print("\n" + "="*60)
print("COMPARISON OF RESULTS")
print("="*60)

# Calculate imputation quality (where we know true values)
def calculate_imputation_error(X_true, X_missing, X_imputed):
    missing_mask = np.isnan(X_missing)
    true_values = X_true[missing_mask]
    imputed_values = X_imputed[missing_mask]
    mse = np.mean((true_values - imputed_values) ** 2)
    return mse

print("\nMean Squared Error for imputed values:")
for name, X_imputed in results.items():
    mse = calculate_imputation_error(X_complete, X_missing, X_imputed)
    print(f"{name}: {mse:.4f}")

# Simple imputation for comparison
X_simple = SimpleImputer(strategy='mean').fit_transform(X_missing)
mse_simple = calculate_imputation_error(X_complete, X_missing, X_simple)
print(f"Simple Mean Imputation: {mse_simple:.4f}")

Data with Missing Values (first 10 rows):
   Feature_A  Feature_B  Feature_C  Feature_D
0   0.496714        NaN        NaN        NaN
1  -0.234153        NaN        NaN  -0.564035
2        NaN  -0.466853  -1.355897        NaN
3   0.241962        NaN   0.676306        NaN
4  -1.012831  -1.237864  -2.445888  -2.280514
5   1.465649        NaN   3.096585   3.847303
6  -0.544383  -1.078193  -1.800294  -2.109810
7  -0.600639  -0.671828  -1.531664  -2.029526
8  -0.013497   0.046651   0.047710        NaN
9   0.208864   0.466083   0.425662   1.024079

Missing values per feature:
Feature_A    30
Feature_B    30
Feature_C    30
Feature_D    30
dtype: int64

ITERATIVE IMPUTATION PROCESS

Step 1: Initial imputation using mean values
Initial imputation values: [ 0.09580839 -0.00327686 -0.28212216  0.13757677]

--- Iteration 1 ---
Feature 0: Predicted 30 missing values
  Mean predicted value: -0.070
Feature 1: Predicted 30 missing values
  Mean predicted value: -0.074
Feature 2: Predicted 30 missing 

