# AutoGluon Tutorial: Automated Machine Learning with Real Datasets

## Overview
AutoGluon is a powerful AutoML library that enables you to build state-of-the-art machine learning models with just a few lines of code. This tutorial demonstrates:
- üöÄ Quick setup and installation
- üìä Working with real datasets (including Kaggle)
- ü§ñ Training multiple models automatically
- üìà Model evaluation and comparison
- üéØ Making predictions

**Author**: AutoGluon Tutorial  
**Dataset**: Titanic (Kaggle)  
**Task**: Binary Classification

## 1. Install and Import AutoGluon

First, we'll install AutoGluon and import the necessary libraries. AutoGluon provides specialized modules for different data types (tabular, text, image).

In [None]:
# Install AutoGluon (uncomment if not already installed)
# !pip install autogluon

# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from autogluon.tabular import TabularPredictor
import warnings
warnings.filterwarnings('ignore')

print("‚úì Libraries imported successfully!")
print(f"AutoGluon version: {TabularPredictor.__module__}")

## 2. Load Dataset from Kaggle

We'll use the famous **Titanic dataset** for this tutorial. This dataset contains information about Titanic passengers and whether they survived.

### Option A: Load from Seaborn (Built-in)
For quick start, we'll use seaborn's built-in Titanic dataset.

### Option B: Load from Kaggle API
To use Kaggle datasets:
1. Get Kaggle API token from kaggle.com/account
2. Place kaggle.json in ~/.kaggle/
3. Use: `!kaggle datasets download -d kaggle/titanic`

In [None]:
# Option A: Load from seaborn (built-in dataset)
df = sns.load_dataset('titanic')

print("‚úì Dataset loaded successfully!")
print(f"\nDataset shape: {df.shape}")
print(f"Number of rows: {df.shape[0]}")
print(f"Number of columns: {df.shape[1]}")

In [None]:
# Optional: Load from Kaggle (uncomment if you have Kaggle API configured)
# !kaggle datasets download -d kaggle/titanic -p ./data --unzip
# df = pd.read_csv('./data/train.csv')

## 3. Explore the Dataset

Let's explore the dataset to understand its structure, features, and data quality.

In [None]:
# Display first few rows
print("First 5 rows of the dataset:")
df.head()

In [None]:
# Dataset information
print("Dataset Information:")
print(df.info())
print("\n" + "="*60)
print("\nColumn Data Types:")
print(df.dtypes)

In [None]:
# Check for missing values
print("Missing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df)) * 100
missing_df = pd.DataFrame({
    'Missing Count': missing,
    'Percentage': missing_pct
})
print(missing_df[missing_df['Missing Count'] > 0].sort_values('Missing Count', ascending=False))

In [None]:
# Statistical summary
print("Statistical Summary:")
df.describe()

In [None]:
# Target variable distribution
print("Target Variable (Survived) Distribution:")
print(df['survived'].value_counts())
print(f"\nSurvival Rate: {df['survived'].mean():.2%}")

# Visualize target distribution
plt.figure(figsize=(10, 4))

plt.subplot(1, 2, 1)
df['survived'].value_counts().plot(kind='bar', color=['#e74c3c', '#2ecc71'])
plt.title('Survival Count')
plt.xlabel('Survived (0=No, 1=Yes)')
plt.ylabel('Count')
plt.xticks(rotation=0)

plt.subplot(1, 2, 2)
df['survived'].value_counts().plot(kind='pie', autopct='%1.1f%%', colors=['#e74c3c', '#2ecc71'])
plt.title('Survival Distribution')
plt.ylabel('')

plt.tight_layout()
plt.show()

## 4. Prepare Data for Training

We'll select relevant features and split the data into training and testing sets.

In [None]:
# Select relevant features for modeling
features = ['pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'survived']
df_model = df[features].copy()

# Drop rows with missing target variable
df_model = df_model.dropna(subset=['survived'])

print(f"Dataset shape after preprocessing: {df_model.shape}")
print(f"\nSelected features: {[col for col in df_model.columns if col != 'survived']}")
print(f"Target variable: survived")

# Display a sample
df_model.head()

In [None]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

train_data, test_data = train_test_split(
    df_model, 
    test_size=0.2, 
    random_state=42,
    stratify=df_model['survived']  # Maintain class balance
)

print(f"‚úì Data split successfully!")
print(f"\nTraining set size: {len(train_data)} samples")
print(f"Test set size: {len(test_data)} samples")
print(f"\nTraining set survival rate: {train_data['survived'].mean():.2%}")
print(f"Test set survival rate: {test_data['survived'].mean():.2%}")

## 5. Train AutoGluon Model

Now comes the magic! AutoGluon will automatically:
- Handle missing values
- Encode categorical variables
- Train multiple models (Random Forest, XGBoost, Neural Networks, etc.)
- Perform hyperparameter tuning
- Create ensemble models
- Select the best model

**All with just 3 lines of code!** üéâ

In [None]:
# Initialize the predictor
predictor = TabularPredictor(
    label='survived',           # Target column
    problem_type='binary',      # Can be 'binary', 'multiclass', 'regression'
    eval_metric='accuracy',     # Metric to optimize
    path='./ag_models/titanic'  # Where to save models
)

print("‚úì Predictor initialized!")
print(f"\nProblem Type: {predictor.problem_type}")
print(f"Evaluation Metric: accuracy")

In [None]:
# Train the model
# This will train multiple models and create ensembles
# Training time can be adjusted based on your needs

predictor.fit(
    train_data=train_data,
    time_limit=120,              # Time limit in seconds (2 minutes)
    presets='medium_quality',    # Options: 'best_quality', 'high_quality', 'good_quality', 'medium_quality'
    verbosity=2                  # 0=silent, 1=minimal, 2=normal, 3=detailed
)

print("\n" + "="*60)
print("‚úì Training completed successfully!")
print("="*60)

### Understanding Training Parameters

- **time_limit**: Total time budget for training (in seconds). More time = better models
- **presets**: Quality/speed trade-off
  - `best_quality`: Highest accuracy, slowest (competition setting)
  - `high_quality`: High accuracy, slower
  - `good_quality`: Good accuracy, moderate speed
  - `medium_quality`: Decent accuracy, faster (prototyping)
- **verbosity**: Amount of output information

## 6. Evaluate Model Performance

Let's evaluate how well our model performs on the test set.

In [None]:
# Evaluate on test data
performance = predictor.evaluate(test_data, silent=False)

print(f"\n{'='*60}")
print(f"Test Accuracy: {performance:.4f} ({performance*100:.2f}%)")
print(f"{'='*60}")

In [None]:
# Model Leaderboard - Compare all trained models
leaderboard = predictor.leaderboard(test_data, silent=True)

print("\nüìä Model Leaderboard (All Trained Models):")
print("="*80)
print(leaderboard)
print("\nüí° The model at the top performed best on the test set!")

In [None]:
# Visualize model comparison
plt.figure(figsize=(12, 6))

# Get top 10 models
top_models = leaderboard.head(10)

plt.barh(range(len(top_models)), top_models['score_test'], color='skyblue')
plt.yticks(range(len(top_models)), top_models['model'])
plt.xlabel('Test Score (Accuracy)')
plt.title('Top 10 Model Performance Comparison')
plt.gca().invert_yaxis()
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

In [None]:
# Detailed performance metrics
from sklearn.metrics import classification_report, confusion_matrix

# Get predictions
y_pred = predictor.predict(test_data.drop(columns=['survived']))
y_true = test_data['survived']

# Classification report
print("\nüìà Detailed Classification Report:")
print("="*60)
print(classification_report(y_true, y_pred, target_names=['Did not survive', 'Survived']))

# Confusion matrix
cm = confusion_matrix(y_true, y_pred)

plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Did not survive', 'Survived'],
            yticklabels=['Did not survive', 'Survived'])
plt.title('Confusion Matrix')
plt.ylabel('True Label')
plt.xlabel('Predicted Label')
plt.tight_layout()
plt.show()

print("\nüí° Confusion Matrix Interpretation:")
print(f"   True Negatives (TN): {cm[0,0]} - Correctly predicted did not survive")
print(f"   False Positives (FP): {cm[0,1]} - Incorrectly predicted survived")
print(f"   False Negatives (FN): {cm[1,0]} - Incorrectly predicted did not survive")
print(f"   True Positives (TP): {cm[1,1]} - Correctly predicted survived")

## 7. Make Predictions

Now let's use our trained model to make predictions on new data.

In [None]:
# Create sample passengers for prediction
sample_passengers = pd.DataFrame({
    'pclass': [3, 1, 2, 3, 1],
    'sex': ['male', 'female', 'male', 'female', 'male'],
    'age': [22, 38, 26, 35, 54],
    'sibsp': [1, 1, 0, 0, 0],
    'parch': [0, 0, 0, 2, 1],
    'fare': [7.25, 71.28, 13.0, 20.5, 51.86],
    'embarked': ['S', 'C', 'S', 'S', 'S']
})

print("Sample Passengers for Prediction:")
print(sample_passengers)

In [None]:
# Make predictions
predictions = predictor.predict(sample_passengers)
probabilities = predictor.predict_proba(sample_passengers)

print("\nüéØ Predictions:")
print("="*80)

for i in range(len(sample_passengers)):
    passenger = sample_passengers.iloc[i]
    pred = predictions.iloc[i]
    prob = probabilities.iloc[i]
    
    print(f"\nPassenger {i+1}:")
    print(f"  Class: {passenger['pclass']}, Sex: {passenger['sex']}, Age: {passenger['age']}")
    print(f"  Fare: ${passenger['fare']:.2f}")
    print(f"  üîÆ Prediction: {'‚úì SURVIVED' if pred == 1 else '‚úó DID NOT SURVIVE'}")
    print(f"  üìä Confidence: {max(prob):.2%}")
    print(f"  üìà Probabilities: Did not survive: {prob[0]:.2%}, Survived: {prob[1]:.2%}")

In [None]:
# Visualize predictions
fig, ax = plt.subplots(figsize=(12, 6))

passenger_labels = [f"P{i+1}\n{row['sex'][0].upper()}, {row['age']}y\nClass {row['pclass']}" 
                   for i, row in sample_passengers.iterrows()]

# Plot survival probabilities
x = np.arange(len(sample_passengers))
width = 0.35

bars1 = ax.bar(x - width/2, probabilities.iloc[:, 0], width, label='Did Not Survive', color='#e74c3c')
bars2 = ax.bar(x + width/2, probabilities.iloc[:, 1], width, label='Survived', color='#2ecc71')

ax.set_xlabel('Passenger')
ax.set_ylabel('Probability')
ax.set_title('Survival Prediction Probabilities for Sample Passengers')
ax.set_xticks(x)
ax.set_xticklabels(passenger_labels)
ax.legend()
ax.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

## 8. Feature Importance Analysis

Understanding which features are most important for predictions helps us:
- Gain insights into the problem
- Identify key factors
- Improve data collection
- Build trust in the model

In [None]:
# Get feature importance
feature_importance = predictor.feature_importance(test_data)

print("üìä Feature Importance:")
print("="*60)
print(feature_importance)
print("\nüí° Higher values = more important for predictions")

In [None]:
# Visualize feature importance
plt.figure(figsize=(10, 6))

colors = plt.cm.viridis(np.linspace(0.3, 0.9, len(feature_importance)))
feature_importance.plot(kind='barh', color=colors)

plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Features', fontsize=12)
plt.title('Feature Importance for Titanic Survival Prediction', fontsize=14, fontweight='bold')
plt.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

# Print interpretation
print("\nüîç Feature Importance Interpretation:")
print("-" * 60)
top_feature = feature_importance.idxmax()
print(f"Most important feature: '{top_feature}'")
print(f"\nThis means '{top_feature}' has the strongest influence on survival predictions.")

## üéì Summary and Next Steps

### What We Accomplished
‚úÖ Loaded a real dataset (Titanic from Kaggle)  
‚úÖ Explored and preprocessed the data  
‚úÖ Trained multiple ML models automatically  
‚úÖ Evaluated model performance  
‚úÖ Made predictions on new data  
‚úÖ Analyzed feature importance  

### Key Takeaways
1. **AutoGluon is powerful**: State-of-art results with minimal code
2. **Automatic everything**: Preprocessing, model selection, hyperparameter tuning
3. **Multiple models**: Trains and compares many models automatically
4. **Easy to use**: Perfect for beginners and experts alike

### Next Steps to Try

1. **Use Different Datasets**:
   - Download other Kaggle datasets
   - Try regression problems (house prices, stock prediction)
   - Work with larger datasets

2. **Customize Training**:
   ```python
   # Increase training time for better results
   predictor.fit(train_data, time_limit=600, presets='best_quality')
   
   # Specify which models to use
   predictor.fit(train_data, hyperparameters={
       'GBM': {},     # LightGBM
       'XGB': {},     # XGBoost
       'CAT': {},     # CatBoost
       'RF': {}       # Random Forest
   })
   ```

3. **Advanced Features**:
   - Multi-label classification
   - Time series forecasting
   - Text and image data
   - Custom feature engineering

### Resources
- üìñ [AutoGluon Documentation](https://auto.gluon.ai/)
- üíª [GitHub Repository](https://github.com/autogluon/autogluon)
- üéì [Tutorials](https://auto.gluon.ai/stable/tutorials/index.html)
- üèÜ [Kaggle Datasets](https://www.kaggle.com/datasets)

---

**Happy AutoML! üöÄ**