## DNN
## Check for GPU Availability

Ensure that TensorFlow detects a GPU. Run this cell in a GPU-enabled environment (e.g., Kaggle Kernels with GPU or a local GPU machine.
.


In [1]:
# Check GPU availability
import tensorflow as tf
print("GPUs Available:", tf.config.list_physical_devices('GPU'))

GPUs Available: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


## Import Libraries

We import necessary libraries including scikeras for wrapping our Keras model, and also import SimpleImputer (for missing value imputation). GPU-enabled training will be used if available.

In [2]:
# Cell 1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os, time
import seaborn as sns
import tensorflow as tf

from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.preprocessing import MinMaxScaler, RobustScaler, StandardScaler
from sklearn.pipeline import Pipeline

# We'll use scikeras for wrapping Keras models
from scikeras.wrappers import KerasClassifier

from tensorflow.keras.layers import Dense, Dropout
from tensorflow.keras.models import Sequential
from tensorflow.keras.optimizers import Adam

import warnings
warnings.filterwarnings("ignore")

%matplotlib inline

## Memory Optimization Function

This function attempts to reduce memory usage by downcasting numeric columns to smaller data types (e.g., float64 -> float32, int64 -> int32.


In [3]:
# Define memory optimization function
def reduce_memory_usage(df, verbose=True):
    """Downcasts numeric columns to reduce memory usage."""
    start_mem = df.memory_usage().sum() / 1024**2
    for col in df.columns:
        col_type = df[col].dtypes
        if col_type == 'float64':
            df[col] = pd.to_numeric(df[col], downcast='float')
        elif col_type == 'int64':
            df[col] = pd.to_numeric(df[col], downcast='integer')
    end_mem = df.memory_usage().sum() / 1024**2
    if verbose:
        print(f"Memory usage reduced from {start_mem:.2f} MB to {end_mem:.2f} MB "
              f"({100 * (start_mem - end_mem) / start_mem:.1f}% reduction)")
    return df

## Load and Optimize Data

We load our train/test data, apply the memory optimization function, and separate features/target. Adjust file names as needed for your environmen.


In [4]:
# Load and optimize data
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')

print("Before memory optimization:")
print("Train shape:", train.shape)
print("Test shape:", test.shape)

train = reduce_memory_usage(train)
test = reduce_memory_usage(test)

# Separate features and target from training data
X = train.drop(columns=['id', 'rainfall'])
y = train['rainfall']

# For test data, drop the id column and save the IDs for submission
X_test = test.drop(columns=['id'])
test_ids = test['id']

print("After optimization:")
print("X shape:", X.shape)
print("X_test shape:", X_test.shape)

Before memory optimization:
Train shape: (2190, 13)
Test shape: (730, 12)
Memory usage reduced from 0.22 MB to 0.09 MB (56.7% reduction)
Memory usage reduced from 0.07 MB to 0.03 MB (54.1% reduction)
After optimization:
X shape: (2190, 11)
X_test shape: (730, 11)


## Define Preprocessing Pipelines

We create a dictionary of scalers to compare (none, standard, minmax, robust). If you prefer, you can remove or add more pipeline.


In [5]:
# Define preprocessing pipelines
preprocessors = {
    'standard': StandardScaler(),
    'minmax': MinMaxScaler(),
    'robust': RobustScaler(),
    'raw': None
}

## Define Keras Model Function

We define a function that builds a DNN model. This will be used by the `KerasClassifier` wrapper. We'll allow hyperparameters like `optimizer`, `dropout_rate`, and `learning_rate` to be tune.


In [6]:
# Define the Keras model function
def create_dnn_model(optimizer='adam', dropout_rate=0.2, learning_rate=0.001):
    model = Sequential()
    model.add(Dense(64, activation='relu', input_shape=(X.shape[1],)))
    model.add(Dropout(dropout_rate))
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(dropout_rate))
    model.add(Dense(1, activation='sigmoid'))
    
    # Build optimizer with the specified learning rate
    if optimizer == 'adam':
        opt = Adam(learning_rate=learning_rate)
    else:
        opt = Adam(learning_rate=learning_rate)
    
    model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['AUC'])
    return model

## Define KerasClassifier and Hyperparameter Grid

We use the `KerasClassifier` wrapper from scikeras, which integrates with scikit-learn. We set up a parameter grid (or distribution) for random search. Adjust the ranges to fit your problem constraint
We wrap our model with scikeras’ KerasClassifier. **Note:** For hyperparameters of the model-building function, we must use the prefix `model__` (and because the pipeline step is named `'dnn'`, we use `'dnn__model__'`)..


In [7]:
# Define the KerasClassifier wrapper and hyperparameter grid
dnn_wrapper = KerasClassifier(
    model=create_dnn_model,
    epochs=20,           # Default epochs; will be tuned further
    batch_size=32,       # Default batch_size; will be tuned further
    verbose=1
)

# Define hyperparameter grid (using correct prefix "dnn__model__" for build_fn parameters)
param_dist = {
    'dnn__model__optimizer': ['adam', 'rmsprop'],
    'dnn__model__dropout_rate': [0.2, 0.3],
    'dnn__model__learning_rate': [1e-3, 1e-4],
    'dnn__epochs': [20, 30],
    'dnn__batch_size': [32, 64]
}

## Randomized Search for Each Preprocessing Pipeline

We iterate through each scaler, create a pipeline that includes scaling + the DNN, and run `RandomizedSearchCV` to find the best hyperparameters. We collect the results in a list for comparison.

In [8]:
# Run hyperparameter tuning and generate submissions
results = []  # List to store results for later comparison
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Create a directory for submission files
os.makedirs("submissions", exist_ok=True)

for prep_name, scaler in preprocessors.items():
    print(f"\n--- Preprocessing: {prep_name} ---")
    
    # Build the pipeline: always include an imputer to handle missing values,
    # then add the scaler if specified, and finally the DNN.
    steps = []
    steps.append(('imputer', SimpleImputer(strategy="median")))
    if scaler is not None:
        steps.append((prep_name, scaler))
    steps.append(('dnn', dnn_wrapper))
    pipeline = Pipeline(steps)
    
    # Perform RandomizedSearchCV with thorough tuning (50 iterations)
    search = RandomizedSearchCV(
        estimator=pipeline,
        param_distributions=param_dist,
        n_iter=50,
        scoring='roc_auc',
        cv=cv,
        verbose=1,
        random_state=42,
        n_jobs=-1
    )
    
    search.fit(X, y)
    
    best_score = search.best_score_
    best_params = search.best_params_
    
    print(f"Best CV ROC AUC for {prep_name}: {best_score:.4f}")
    print("Best parameters:", best_params)
    
    # Generate test predictions using the best estimator from the pipeline.
    # The imputer and scaler steps will automatically transform X_test.
    test_preds = search.best_estimator_.predict_proba(X_test)[:, 1]
    
    # Save the submission file for this preprocessing method.
    sub_filename = f"submissions/{prep_name}_dnn_submission.csv"
    sub_df = pd.DataFrame({'id': test_ids, 'rainfall': test_preds})
    sub_df.to_csv(sub_filename, index=False)
    print(f"Submission file saved: {sub_filename}\n")
    
    # Record results for later comparison.
    results.append({
        'preprocessing': prep_name,
        'best_auc': best_score,
        'best_params': best_params,
        'submission_file': sub_filename
    })


--- Preprocessing: standard ---
Fitting 5 folds for each of 32 candidates, totalling 160 fits


KeyboardInterrupt: 

## Compare Results

We create a DataFrame from the results and sort by the best AUC score to see which preprocessing pipeline + hyperparameters performed bes.


In [None]:
# Compare results across different preprocessing pipelines
results_df = pd.DataFrame(results).sort_values(by='best_auc', ascending=False)
print("\nComparison of Preprocessing Approaches:")
display(results_df)

## Train Final Model & Generate Predictions

You may choose to train a final model using the best pipeline & parameters, then generate predictions for your test set. Below is an example of how to do this with the top result from `results_df.


In [None]:
# Retrain final models using best parameters and generate submission files
for idx, row in results_df.iterrows():
    prep = row['preprocessing']
    best_params = row['best_params']
    print(f"Retraining final model with preprocessing: {prep}")
    
    if prep in preprocessors and preprocessors[prep] is not None:
        final_pipeline = Pipeline([
            (prep, preprocessors[prep]),
            ('dnn', dnn_wrapper.set_params(**best_params))
        ])
    else:
        final_pipeline = Pipeline([
            ('dnn', dnn_wrapper.set_params(**best_params))
        ])
    
    final_pipeline.fit(X, y)
    final_preds = final_pipeline.predict_proba(X_test)[:, 1]
    
    final_sub_filename = f"submissions/final_{prep}_dnn_submission.csv"
    final_sub_df = pd.DataFrame({'id': test_ids, 'rainfall': final_preds})
    final_sub_df.to_csv(final_sub_filename, index=False)
    print(f"Final submission saved: {final_sub_filename}\n")

# Conclusion

1. **Memory Optimization**: We used a simple function to downcast numerical columns, reducing the dataset's memory footprint.
2. **Preprocessing Comparison**: We compared several scalers (Standard, MinMax, Robust) in a pipeline with a DNN.
3. **Hyperparameter Tuning**: Used `RandomizedSearchCV` to explore different dropout rates and learning rates.
4. **Results**: We displayed each pipeline’s best ROC AUC score and hyperparameters, then retrained a final model for submission.

This approach can be extended by:
- Adding more preprocessing techniques (e.g., PCA, custom transformations).
- Tuning additional hyperparameters (batch size, epochs, network architecture).
- Using more folds or a larger `n_iter` for RandomizedSearchCV for better hyperparameter exloration.
