Exercise Case Study Notebook: Tabular Models

2. Data Loading:

In [None]:
import requests

# URLs of the files
train_data_url = 'https://www.raphaelcousin.com/modules/module4/course/module5_course_handling_duplicate_train.csv'
test_data_url = 'https://www.raphaelcousin.com/modules/module4/course/module5_course_handling_duplicate_test.csv'

# Function to download a file
def download_file(url, file_name):
    response = requests.get(url)
    response.raise_for_status()  # Ensure we notice bad responses
    with open(file_name, 'wb') as file:
        file.write(response.content)
    print(f'Downloaded {file_name} from {url}')

# Downloading the files
download_file(train_data_url, 'module5_course_handling_duplicate_train.csv')
download_file(test_data_url, 'module5_course_handling_duplicate_test.csv')

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [None]:
# Load the dataset
df = pd.read_csv("tabular_exercise_dataset.csv")

# Split the data
X = df.drop('target', axis=1)
y = df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

3. Tabular Modeling Tasks:

a. Model Selection and Cross-Validation:
   - Task: Implement k-fold cross-validation for a simple model (e.g., logistic regression)
   - Question: How does the choice of k affect the model evaluation? Experiment with different values.

b. Hyperparameter Optimization:
   - Task: Use RandomizedSearchCV to optimize hyperparameters for a Random Forest model
   - Question: Compare the performance of RandomizedSearchCV with a simple grid search. What are the trade-offs?

c. Linear Models:
   - Task: Implement and compare Lasso, Ridge, and Elastic Net regression
   - Question: Which regularization technique performs best for this dataset? Why?

d. Tree-Based Models:
   - Task: Implement a Gradient Boosting model (e.g., XGBoost or LightGBM)
   - Question: Compare the performance and training time of Gradient Boosting with Random Forest. Discuss the results.

e. Support Vector Machines:
   - Task: Implement an SVM with different kernel functions
   - Question: How does the choice of kernel affect the model's performance and computational efficiency?

f. K-Nearest Neighbors:
   - Task: Implement KNN and experiment with different values of k
   - Question: How does the value of k impact the model's bias-variance trade-off?

g. Naive Bayes:
   - Task: Apply Gaussian Naive Bayes to the dataset
   - Question: In what scenarios might Naive Bayes outperform more complex models?

h. Ensemble Techniques:
   - Task: Create a voting classifier using three different base models
   - Question: How does the ensemble's performance compare to individual models? Explain the synergy.

i. Time Series Consideration:
   - Question: If this dataset contains a time component, how would you modify your modeling approach?

j. AutoML:
   - Task: Use an AutoML library (e.g., auto-sklearn or TPOT) on the dataset
   - Question: Compare the AutoML results with your manually tuned models. Discuss the pros and cons of using AutoML in this scenario.

4. Model Evaluation and Comparison:
   - Task: Implement a function to compare all models using appropriate metrics
   - Question: Based on your results, which model would you choose for deployment? Justify your answer.

5. Submission:

In [None]:
X_test_final = pd.read_csv("module6_exercise_test.csv", sep=",", index_col='id')
submission = pd.DataFrame({
    'id': X_test_final.index,
    'target': best_model.predict(X_test_final)  # Use your best model here
})

submission.to_csv('submission.csv', index=False, sep=',')

6. Final Questions:
   - Summarize the key findings from your model comparison.
   - How might feature engineering or selection improve these models?
   - Discuss the potential impact of class imbalance or data leakage on your results.
   - What additional steps would you take before deploying the chosen model in a real-world scenario?