# Baseline Model Evolution Report

### Initial Version for Submission (v2.0) - Enhanced Baseline

#### Data Processing Improvements
**Added:** Manual missing value handling with median imputation
- Replaced basic fillna() with manual median calculation and storage
- Ensures consistent preprocessing between train and test data
- `col_medians = []` list to store training medians for test preprocessing

**Added:** Feature selection pipeline
- `remove_correlated_features()` function to eliminate highly correlated features (>0.9 threshold)
- Reduces multicollinearity and potential overfitting
- Dynamic feature removal based on correlation matrix

**Added:** Manual Min-Max scaling implementation
- `manual_minmax_scale()` function replacing sklearn dependency
- Stores min/max values for consistent test data transformation
- Handles edge cases (division by zero when min equals max)

#### Model Architecture Enhancements
**Improved:** SVM Implementation (`ImprovedSVM` class)
- Enhanced convergence checking with cost threshold
- Better gradient calculation handling for single samples
- Improved weight initialization and training stability
- Added proper label conversion to {-1, +1} for SVM

**Added:** Logistic Regression Implementation
- Custom `LogisticRegression` class with sigmoid activation
- Gradient descent optimization with clipping to prevent overflow
- Regularization through proper weight initialization
- Progress monitoring during training

**Added:** Ensemble Model
- `EnsembleModel` class combining SVM and Logistic Regression
- Majority voting mechanism for final predictions
- Leverages strengths of both model types

#### Training and Evaluation Improvements
**Added:** Model comparison framework
- Dictionary-based model storage for easy iteration
- Automatic best model selection based on validation accuracy
- Comprehensive accuracy reporting for all models

**Enhanced:** Hyperparameter considerations
- Configurable learning rates and regularization strengths
- Maximum iteration limits with early stopping
- Model-specific parameter tuning

#### Test Data Processing
**Improved:** Consistent preprocessing pipeline
- Applies same median imputation as training data
- Removes identical correlated features
- Applies identical scaling transformation
- Maintains feature alignment between train and test

#### Code Quality Improvements
**Added:** Better error handling and edge case management
- Numpy array shape consistency checks
- Overflow prevention in sigmoid calculations
- Robust gradient calculations

**Enhanced:** Documentation and logging
- Progress printing during model training
- Clear model performance reporting
- Step-by-step processing feedback

### Performance Impact
- **Baseline accuracy:** ~54.6% (estimated initial performance)
- **Current best model:** 56.1% (Logistic Regression)
- **Ensemble model:** 52.2% (showing potential for further tuning)

### Key Technical Decisions
1. **Manual implementations** over sklearn to maintain full control and understanding
2. **Feature selection** to reduce dimensionality and improve generalization
3. **Ensemble approach** to combine different model strengths
4. **Consistent preprocessing** to ensure train/test data alignment

### Future Improvement Opportunities
- Hyperparameter optimization (grid search, random search)
- Additional feature engineering techniques
- More sophisticated ensemble methods (weighted voting, stacking)
- Cross-validation for more robust model selection
- Advanced regularization techniques

### Dependencies Managed
- Minimal external dependencies (numpy, pandas)
- Custom implementations for core ML algorithms
- Reproducible preprocessing pipeline

---

### Version 2.1 - Advanced Preprocessing and Model Tuning

#### Reasoning and Methods
To address the limitations of the initial preprocessing pipeline, advanced techniques were introduced to better handle feature scaling and non-linear relationships. Standardization (z-score) was implemented to ensure features had zero mean and unit variance, improving model convergence. Polynomial feature expansion (degree 2) was added to capture non-linear interactions, and log transformations were applied to reduce the impact of skewed features and outliers.

#### Expectations
These enhancements were expected to improve the performance of models sensitive to feature scaling and non-linearities, such as Logistic Regression and SVM.

#### Outcomes
- Logistic Regression achieved 57.3% validation accuracy, demonstrating improved performance with standardized and polynomially expanded features.
- SVM achieved 54.6% validation accuracy, benefiting from log-transformed features and z-score normalization.
- Ensemble model achieved 53.1% validation accuracy, highlighting the potential of combining multiple models for robustness.

#### Reflections
The introduction of advanced preprocessing techniques led to noticeable performance improvements. Logistic Regression emerged as the best-performing model in this version, validating the importance of feature scaling and transformation.

---

### Version 3.0 - Comprehensive ML Pipeline

#### Reasoning and Methods
This version focused on creating a comprehensive pipeline with advanced feature engineering and systematic hyperparameter tuning. Interaction terms, binning, and k-means clustering were introduced to capture complex relationships. Power transforms (sqrt, square) were applied to enhance feature distributions, and low variance feature removal was used to reduce noise.

#### Expectations
These methods were expected to improve model interpretability and performance by creating more informative features and reducing noise.

#### Outcomes
- Logistic Regression with L2 regularization showed enhanced generalization and reduced overfitting.
- RBF SVM achieved faster training and improved accuracy.
- Weighted voting in ensemble methods combined model predictions effectively, boosting overall performance.

#### Reflections
The systematic approach to feature engineering and hyperparameter tuning significantly boosted model performance. The pipeline demonstrated the importance of preprocessing variants and model selection.

---

### Version 4.0 - PRBF Hybrid Kernel SVM

#### Reasoning and Methods
A novel PRBF (Polynomial-RBF) hybrid kernel SVM was implemented to combine the strengths of RBF and Polynomial kernels. The hybrid kernel formula allowed for a tunable mixing ratio, providing flexibility in capturing complex patterns.

#### Expectations
The hybrid kernel was expected to outperform individual kernels by leveraging their complementary strengths.

#### Outcomes
- Achieved 71% validation accuracy, demonstrating the effectiveness of hybrid kernels.
- Highlighted the potential of combining kernel methods for complex datasets.

#### Reflections
The PRBF hybrid kernel validated the hypothesis that combining kernels can enhance model performance. However, the complexity of tuning multiple parameters posed challenges.

---

### Version 5.0 - RandomForest and PRBF Hybrid

#### Reasoning and Methods
This version introduced a custom RandomForest with early stopping and OOB validation. Bootstrap sampling and feature subsampling were used to reduce variance, while early stopping prevented overfitting. The PRBF Kernel SVM was further optimized for better performance.

#### Expectations
RandomForest was expected to provide robust performance due to its ensemble nature, while the optimized PRBF Kernel SVM aimed to improve accuracy further.

#### Outcomes
- RandomForest achieved 74.75% validation accuracy, outperforming other models.
- PRBF Kernel SVM showed marginal improvements but was overshadowed by RandomForest.

#### Reflections
RandomForest emerged as the primary focus due to its superior performance and robustness. The ensemble approach proved effective in handling complex datasets.

---

### Version 6.0 - Optimized RandomForest

#### Reasoning and Methods
Underperforming models with accuracy below 72% were removed to streamline the implementation. The focus shifted entirely to optimizing RandomForest, with enhanced preprocessing tailored to its strengths.

#### Expectations
By concentrating on a single high-performing model, further improvements in accuracy and efficiency were anticipated.

#### Outcomes
- RandomForest achieved 74.75% validation accuracy, maintaining its position as the leading model.

#### Reflections
The decision to focus on RandomForest simplified the implementation and improved performance. This version highlighted the importance of prioritizing high-performing models.

---

### Version 7.0 - Enhanced RandomForest

#### Reasoning and Methods
Advanced features were added to RandomForest, including OOB validation, early stopping, and improved feature subsampling. These enhancements aimed to further boost accuracy and reduce overfitting.

#### Expectations
The enhanced RandomForest was expected to achieve the highest accuracy in the project, solidifying its position as the champion model.

#### Outcomes
- Achieved 76.25% validation accuracy, marking the highest performance in the project.

#### Reflections
The enhanced RandomForest validated the effectiveness of ensemble methods and advanced features. This version finalized the model as production-ready, achieving significant accuracy improvements.


In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Load dataset
df = pd.read_csv('/kaggle/input/mldl-2025/train.csv')
if 'ID' in df.columns:
    df = df.drop(columns=['ID'])

X = df.drop(columns=['Y'])
y = df['Y'].values

# Handle missing values with median
col_medians = []
for col in X.columns:
    median = X[col].median()
    col_medians.append(median)
    X[col] = X[col].fillna(median)

def remove_correlated_features(X):
    corr = X.corr()
    drop_columns = []
    for i in range(len(corr.columns)):
        for j in range(i + 1, len(corr.columns)):
            if abs(corr.iloc[i, j]) >= 0.85:  
                drop_columns.append(corr.columns[j])
    drop_columns = list(set(drop_columns))
    X.drop(drop_columns, axis=1, inplace=True)
    return drop_columns

def manual_standardize(X):
    X_std = X.copy()
    means = X.mean()
    stds = X.std()
    for col in X.columns:
        if stds[col] != 0:
            X_std[col] = (X[col] - means[col]) / stds[col]
        else:
            X_std[col] = 0
    return X_std, means, stds

def polynomial_features(X):
    X_poly = X.copy()
    cols = X.columns
    new_features = {}
    for i in range(len(cols)):
        for j in range(i, len(cols)):
            new_col = f"{cols[i]}*{cols[j]}"
            new_features[new_col] = X[cols[i]] * X[cols[j]]
    X_poly = pd.concat([X_poly, pd.DataFrame(new_features, index=X.index)], axis=1)
    return X_poly

def remove_low_variance_features(X, threshold=0.005): 
    variances = X.var()
    low_var_cols = variances[variances < threshold].index
    return X.drop(columns=low_var_cols), low_var_cols

# Feature selection
corr_dropped = remove_correlated_features(X)
X, low_var_dropped = remove_low_variance_features(X)

# Best preprocessing: polynomial + standardization
X_poly = polynomial_features(X)
X_poly_std, X_poly_means, X_poly_stds = manual_standardize(X_poly)
X_poly_std = pd.concat([X_poly_std, pd.DataFrame({'intercept': 1}, index=X_poly_std.index)], axis=1)

# Train/test split
X_train, X_val, y_train, y_val = train_test_split(X_poly_std.values, y, test_size=0.25, random_state=42)

In [None]:
# Enhanced RandomForest implementation with OOB validation and early stopping
class Model:
    def __init__(self):
        self.n_estimators = 250          
        self.max_depth = 15             
        self.min_samples_split = 5      
        self.min_samples_leaf = 3       
        self.max_features = 'log2'      
        self.bootstrap = True
        self.trees = []
        self.oob_indices = []
        self.patience = 15              
        self.best_oob = -1
        self.no_improve = 0

    def _gini_impurity(self, y):
        if len(y) == 0:
            return 0
        p = np.sum(y == 1) / len(y)
        return 2 * p * (1 - p)

    def _information_gain(self, y, left_y, right_y):
        n = len(y)
        if n == 0:
            return 0
        n_left = len(left_y)
        n_right = len(right_y)
        
        parent_gini = self._gini_impurity(y)
        left_gini = self._gini_impurity(left_y)
        right_gini = self._gini_impurity(right_y)
        
        weighted_gini = (n_left / n) * left_gini + (n_right / n) * right_gini
        return parent_gini - weighted_gini

    def _build_tree(self, X, y, depth=0):
        n_samples, n_features = X.shape
        
        if (depth >= self.max_depth or
            n_samples < self.min_samples_split or
            len(np.unique(y)) == 1):
            return {'leaf': True, 'prediction': np.round(np.mean(y))}
        
        # Feature selection method
        if self.max_features == 'sqrt':
            max_features = int(np.sqrt(n_features))
        elif self.max_features == 'log2':
            max_features = max(1, int(np.log2(n_features)))
        else:
            max_features = n_features
        
        feature_indices = np.random.choice(n_features, max_features, replace=False)
        
        best_gain = -1
        best_feature = None
        best_threshold = None
        
        for feature_idx in feature_indices:
            thresholds = np.percentile(X[:, feature_idx], [20, 35, 50, 65, 80]) 
            
            for threshold in thresholds:
                left_mask = X[:, feature_idx] <= threshold
                right_mask = ~left_mask
                
                if (np.sum(left_mask) < self.min_samples_leaf or 
                    np.sum(right_mask) < self.min_samples_leaf):
                    continue
                
                gain = self._information_gain(y, y[left_mask], y[right_mask])
                
                if gain > best_gain:
                    best_gain = gain
                    best_feature = feature_idx
                    best_threshold = threshold
        
        if best_feature is None:
            return {'leaf': True, 'prediction': np.round(np.mean(y))}
        
        left_mask = X[:, best_feature] <= best_threshold
        right_mask = ~left_mask
        
        left_tree = self._build_tree(X[left_mask], y[left_mask], depth + 1)
        right_tree = self._build_tree(X[right_mask], y[right_mask], depth + 1)
        
        return {
            'leaf': False,
            'feature': best_feature,
            'threshold': best_threshold,
            'left': left_tree,
            'right': right_tree
        }

    def _predict_tree(self, tree, X):
        if tree['leaf']:
            return np.full(len(X), tree['prediction'])
        
        predictions = np.zeros(len(X))
        left_mask = X[:, tree['feature']] <= tree['threshold']
        right_mask = ~left_mask
        
        if np.sum(left_mask) > 0:
            predictions[left_mask] = self._predict_tree(tree['left'], X[left_mask])
        if np.sum(right_mask) > 0:
            predictions[right_mask] = self._predict_tree(tree['right'], X[right_mask])
        
        return predictions

    def fit(self, X, y):
        n_samples = X.shape[0]
        print(f"- n_estimators: {self.n_estimators}")
        print(f"- max_depth: {self.max_depth}")
        print(f"- min_samples_split: {self.min_samples_split}")
        print(f"- min_samples_leaf: {self.min_samples_leaf}")
        print(f"- max_features: {self.max_features}")
        print(f"- patience: {self.patience}")
        
        for i in range(self.n_estimators):
            if self.bootstrap:
                indices = np.random.choice(n_samples, n_samples, replace=True)
                X_bootstrap = X[indices]
                y_bootstrap = y[indices]
                oob_idx = np.setdiff1d(np.arange(n_samples), indices)
                self.oob_indices.append(oob_idx)
            else:
                X_bootstrap = X
                y_bootstrap = y
                self.oob_indices.append(np.arange(n_samples))
            
            tree = self._build_tree(X_bootstrap, y_bootstrap)
            self.trees.append(tree)
            
            # heck OOB every 5 trees
            if (i + 1) % 5 == 0:
                oob_pred = self._get_oob_predictions(X, i + 1)
                oob_acc = np.mean(oob_pred == y)
                print(f"[{i + 1:3d}] OOB Accuracy: {oob_acc*100:.2f}%")
                
                if oob_acc > self.best_oob + 1e-6:
                    self.best_oob = oob_acc
                    self.no_improve = 0
                else:
                    self.no_improve += 1
                
                if self.no_improve >= self.patience:
                    print(f"Early stopping at tree {i + 1}")
                    break

    def _get_oob_predictions(self, X, n_trees):
        n_samples = X.shape[0]
        oob_votes = np.zeros(n_samples)
        oob_counts = np.zeros(n_samples)
        
        for t in range(n_trees):
            idx = self.oob_indices[t]
            if idx.size == 0:
                continue
            
            preds = self._predict_tree(self.trees[t], X[idx])
            oob_votes[idx] += preds
            oob_counts[idx] += 1
        
        mask = oob_counts > 0
        oob_final = np.zeros(n_samples, dtype=int)
        oob_final[mask] = (oob_votes[mask] / oob_counts[mask] > 0.5).astype(int)
        
        return oob_final

    def predict_proba(self, X):
        n_samples = X.shape[0]
        predictions = np.zeros(n_samples)
        
        for tree in self.trees:
            predictions += self._predict_tree(tree, X)
        
        return predictions / len(self.trees)

    def predict(self, X):
        return (self.predict_proba(X) > 0.5).astype(int)

In [None]:
# Train model
model = Model()
model.fit(X_train, y_train)
preds = model.predict(X_val)
accuracy = np.mean(preds == y_val)
print(f"\nFinal Validation Accuracy: {accuracy*100:.2f}%")

In [None]:
# Load and preprocess test data
test_df = pd.read_csv('/kaggle/input/mldl-2025/test.csv')
test_ids = test_df.index

X_test = test_df.copy()

# Handle missing values using training medians
for i, col in enumerate(X_test.columns):
    X_test[col] = X_test[col].fillna(col_medians[i])

# Remove same features as training
X_test.drop(corr_dropped, axis=1, inplace=True, errors='ignore')
X_test.drop(low_var_dropped, axis=1, inplace=True, errors='ignore')

# Apply polynomial features then standardize
X_test_transformed = polynomial_features(X_test)
for col in X_poly_means.index:
    if col in X_test_transformed.columns:
        if X_poly_stds[col] != 0:
            X_test_transformed[col] = (X_test_transformed[col] - X_poly_means[col]) / X_poly_stds[col]
        else:
            X_test_transformed[col] = 0

X_test_transformed = pd.concat([X_test_transformed, pd.DataFrame({'intercept': 1}, index=X_test_transformed.index)], axis=1)

# Generate predictions
preds = model.predict(X_test_transformed.values)

# Save submission
submission_df = pd.DataFrame({
    'ID': test_ids,
    'Potability': preds
})
submission_df.to_csv('submission.csv', index=False)
print("Predictions saved to submission.csv")