## Data Loading

The first step is to load the data into a suitable format for processing. For demonstration, generate a synthetic dataset. Define a function that generates a synthetic dataset with numerical and categorical features, and a binary target variable.

In [1]:
import pandas as pd
import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

def generate_synthetic_data(n_samples=1000, n_features=20):
    """
    Generates a synthetic dataset for binary classification.

    Args:
        n_samples: The number of samples to generate.
        n_features: The number of features to generate.

    Returns:
        A Pandas DataFrame containing the synthetic dataset.
    """
    # Generate the synthetic dataset
    X, y = make_classification(
        n_samples=n_samples,
        n_features=n_features,
        n_informative=15,  # Number of informative features
        n_redundant=5,    # Number of redundant features
        random_state=42,
        n_classes=2,
        weights=[0.8, 0.2], #Imbalanced classes
        flip_y=0.05 # Add some noise to the labels
    )

    # Convert to Pandas DataFrame
    df = pd.DataFrame(X, columns=[f'Feature_{i}' for i in range(1, n_features + 1)])
    df['Target'] = y

    # Add categorical features
    df['Feature_11'] = np.random.choice(['A', 'B', 'C'], size=n_samples)
    df['Feature_12'] = np.random.choice(['X', 'Y', 'Z', 'W'], size=n_samples)
    df['Feature_13'] = np.random.choice(['P', 'Q'], size=n_samples)

    return df

In [2]:
# Load the data
data = generate_synthetic_data()

In [3]:
# Split data into training and testing
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

## Data Processing

Raw data often contains missing values, inconsistencies, and categorical variables that need to be transformed into a numerical format. This step prepares the data for model training.
- Handle Missing Values: Impute missing values using the mean for numerical features and the mode for categorical features.
- Encode Categorical Features: Use one-hot encoding to convert categorical variables into numerical representations.
- Scale Numerical Features: Use StandardScaler to standardize numerical features.

In [6]:
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer

def preprocess_data(train_data, test_data):
    """
    Preprocesses the training and testing data.

    Args:
        train_data: Pandas DataFrame containing the training data.
        test_data: Pandas DataFrame containing the testing data.

    Returns:
        Tuple of preprocessed training and testing data (X_train, X_test, y_train, y_test).
    """
    # Separate features and target variable
    X_train = train_data.drop('Target', axis=1)
    y_train = train_data['Target']
    X_test = test_data.drop('Target', axis=1)
    y_test = test_data['Target']

    # Identify numerical and categorical features
    numerical_features = X_train.select_dtypes(include=np.number).columns
    categorical_features = X_train.select_dtypes(include='object').columns

    # Create transformers for preprocessing
    numerical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='mean')),
        ('scaler', StandardScaler())
    ])
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

    # Use ColumnTransformer to apply transformers to the correct columns
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_features),
            ('cat', categorical_transformer, categorical_features)
        ])

    # Fit and transform the training data
    X_train_processed = preprocessor.fit_transform(X_train)
    X_train_processed = pd.DataFrame(X_train_processed) # Convert back to dataframe
    
    # Transform the testing data using the same preprocessor
    X_test_processed = preprocessor.transform(X_test)
    X_test_processed = pd.DataFrame(X_test_processed)

    return X_train_processed, X_test_processed, y_train, y_test

In [7]:
# Preprocess the data
X_train, X_test, y_train, y_test = preprocess_data(train_data, test_data)

## Model Selection and Training with AutoML

Use auto-sklearn to automatically search for the best-performing machine learning model and tune its hyperparameters. AutoML simplifies the model selection and hyperparameter tuning process, which can be time-consuming and require extensive expertise.
- Initialize an Auto-sklearn classifier.
- Set a time limit for the search process.
- Fit the Auto-sklearn instance to the training data.


In [None]:
import autosklearn.classification
import sklearn.model_selection
import sklearn.metrics
import time

def train_automl_model(X_train, y_train, time_limit=60):
    """
    Trains an AutoML model using auto-sklearn.

    Args:
        X_train: Preprocessed training features.
        y_train: Training target variable.
        time_limit: The time limit for the AutoML search in seconds.

    Returns:
        The trained Auto-sklearn model.
    """
    # Create an Auto-sklearn classification object
    automl = autosklearn.classification.AutoSklearnClassifier(
        time_left_for_this_task=time_limit,  # Time limit in seconds
        per_run_time_limit=15,            # Time limit for each individual run
        memory_limit=4096,               # Memory limit in MB
        n_jobs=-1,                       # Use all available CPU cores
        # Resampling strategy
        resampling_strategy="cv",
        resampling_strategy_arguments={'folds': 3},
        
        #Enables logging
        #delete_tmp_folder_after_terminate=False, # Keep the temp files.
        #tmp_folder="tmp"
    )

    # Fit the AutoML model to the training data
    automl.fit(X_train, y_train, dataset_name='synthetic_data')

    return automl

# Train the AutoML model
automl_model = train_automl_model(X_train, y_train, time_limit=120) # Increased time limit for better results

In [None]:
# Evaluate the model
evaluation_metrics = evaluate_model(automl_model, X_test, y_test)

## Model Deployment with Streamlit

To make the model accessible to users, we'll deploy it as a web application using Streamlit. This allows users to input data and get predictions without needing to write any code.

-> automl_app.py