# üìö Complete Guide to PyCaret and Streamlit for Machine Learning

**Author:** Victor Bolade - Data Science Lead Instructor

**Date:** January 2026

---

## üéØ What You'll Learn

1. What is PyCaret and why it's powerful
2. What is Streamlit and how to deploy ML models
3. Complete ML workflow with PyCaret
4. Multiple approaches (flexible vs rigid)
5. Handling imbalanced datasets
6. Production-ready code examples

---

# 1Ô∏è‚É£ What is PyCaret?

## üìñ Definition

**PyCaret** is an open-source, **low-code** machine learning library in Python that automates machine learning workflows.

## üöÄ Why Use PyCaret?

### Traditional ML Workflow (Without PyCaret)
```python
# You have to write 100+ lines of code
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, roc_auc_score
import pandas as pd

# Load data
data = pd.read_csv('data.csv')

# Split features and target
X = data.drop('target', axis=1)
y = data['target']

# Handle categorical variables
le = LabelEncoder()
for col in X.select_dtypes(include='object').columns:
    X[col] = le.fit_transform(X[col])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Evaluate
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions)}")
```

### Same Workflow with PyCaret (Just 5 Lines!)
```python
from pycaret.classification import *

# Setup (handles everything: encoding, scaling, splitting)
clf = setup(data=data, target='target', session_id=42)

# Train and compare ALL models automatically
best_model = compare_models()

# Done! ‚úÖ
```

## üéØ Key Features of PyCaret

1. **Automated Data Preprocessing** - Handles missing values, encoding, scaling
2. **Model Comparison** - Tests 15+ algorithms in one line
3. **Hyperparameter Tuning** - Automatic optimization
4. **Model Deployment** - Easy integration with Streamlit, Flask, FastAPI
5. **Imbalance Handling** - Built-in SMOTE and other techniques
6. **Cross-Validation** - Automatic k-fold CV

## üì¶ What Can You Build with PyCaret?

- **Classification** - Customer churn, fraud detection, disease diagnosis
- **Regression** - House price prediction, sales forecasting
- **Clustering** - Customer segmentation
- **Anomaly Detection** - Fraud detection, network intrusion
- **Time Series** - Stock price prediction, demand forecasting
- **NLP** - Sentiment analysis, text classification

---

# 2Ô∏è‚É£ What is Streamlit?

## üìñ Definition

**Streamlit** is an open-source Python framework that lets you **create interactive web applications** for machine learning and data science projects **without any frontend knowledge** (no HTML, CSS, or JavaScript needed!).

## üöÄ Why Use Streamlit?

### Traditional Web App (Flask/Django)
```python
# Flask - You need to write HTML, CSS, JavaScript
from flask import Flask, render_template, request

app = Flask(__name__)

@app.route('/')
def home():
    return render_template('index.html')  # Need to create HTML file

@app.route('/predict', methods=['POST'])
def predict():
    # Complex form handling
    data = request.form.to_dict()
    # More code...
```

### Same App with Streamlit (Pure Python!)
```python
import streamlit as st

st.title("My ML App")
age = st.number_input("Enter age")
if st.button("Predict"):
    prediction = model.predict([[age]])
    st.write(f"Result: {prediction}")
```

## üéØ Key Features of Streamlit

1. **Pure Python** - No HTML/CSS/JavaScript needed
2. **Interactive Widgets** - Sliders, buttons, file uploaders, text inputs
3. **Data Visualization** - Charts, plots, maps
4. **Fast Deployment** - Deploy to Streamlit Cloud for free
5. **Real-time Updates** - Changes reflect immediately
6. **Session State** - Maintain user data across interactions

## üåü Perfect Combination: PyCaret + Streamlit

**PyCaret** - Build and train your ML model

**Streamlit** - Deploy your model as a web app

**Result** - Production-ready ML application in minutes!

---

# 3Ô∏è‚É£ Installation

## Step 1: Create Conda Environment (Recommended)

```bash
# Create environment
conda create -n pycaret_env python=3.10 -y

# Activate environment
conda activate pycaret_env

# Navigate to safe directory
cd %USERPROFILE%  # Windows
cd ~  # Mac/Linux
```

## Step 2: Fix SSL Issues (For Nigerian Networks)

```bash
conda config --set ssl_verify false
pip install --upgrade pip setuptools wheel
pip config set global.timeout 100
```

## Step 3: Install PyCaret and Streamlit

```bash
# Install PyCaret
pip install pycaret

# If it fails, use:
pip install pycaret --no-cache-dir

# Install Streamlit
pip install streamlit
```

## Step 4: Verify Installation

```python
# Test PyCaret
from pycaret.classification import *
print("‚úÖ PyCaret installed successfully!")

# Test Streamlit
import streamlit as st
print("‚úÖ Streamlit installed successfully!")
```

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
from pycaret.classification import *

# Verify imports
print("‚úÖ All libraries imported successfully!")
print(f"Pandas version: {pd.__version__}")
print(f"NumPy version: {np.__version__}")

---

# 4Ô∏è‚É£ Understanding PyCaret Algorithms

## üìä Classification Algorithms Available in PyCaret

When you run `compare_models()`, PyCaret automatically tests these algorithms:

| **Algorithm ID** | **Full Name** | **Best For** |
|------------------|---------------|---------------|
| `lr` | Logistic Regression | Simple, interpretable models |
| `knn` | K-Nearest Neighbors | Pattern recognition |
| `nb` | Naive Bayes | Text classification, spam detection |
| `dt` | Decision Tree | Interpretable, handles non-linear data |
| `svm` | Support Vector Machine | High-dimensional data |
| `rbfsvm` | SVM with RBF Kernel | Non-linear classification |
| `gpc` | Gaussian Process | Small datasets, uncertainty estimates |
| `mlp` | Multi-Layer Perceptron | Neural networks |
| `ridge` | Ridge Classifier | Regularized linear models |
| `rf` | Random Forest | Most versatile, handles everything |
| `qda` | Quadratic Discriminant | Non-linear decision boundaries |
| `ada` | AdaBoost | Boosting weak learners |
| `gbc` | Gradient Boosting | High accuracy, slower training |
| `lda` | Linear Discriminant | Dimensionality reduction + classification |
| `et` | Extra Trees | Like Random Forest but faster |
| `xgboost` | XGBoost | Industry standard, Kaggle winner |
| `lightgbm` | LightGBM | Fast gradient boosting |
| `catboost` | CatBoost | Handles categorical data well |

## üéØ How to Know Which Algorithm Was Selected?

```python
# After running compare_models()
best_model = compare_models()

# Print the algorithm name
print(f"Best Algorithm: {type(best_model).__name__}")
print(f"Full Details: {best_model}")
```

---

# 5Ô∏è‚É£ Complete ML Workflow with PyCaret

## üóÇÔ∏è Dataset: Customer Churn Prediction

We'll use a **Telco Customer Churn** dataset to predict which customers will leave (churn).

**Business Problem:** A telecommunications company wants to identify customers who are likely to cancel their subscription.

**Target Variable:** `Churn` (Yes/No)

## üì• Step 1: Load and Explore Data

In [None]:
# Create sample data (you can replace this with your own CSV)
# For demonstration, let's create a sample dataset

# In real scenario, you would load like this:
# data = pd.read_csv("customer_churn.csv")

# Sample data creation for demo
np.random.seed(42)
data = pd.DataFrame({
    'customerID': [f'CUST{i:04d}' for i in range(1, 1001)],
    'gender': np.random.choice(['Male', 'Female'], 1000),
    'SeniorCitizen': np.random.choice([0, 1], 1000),
    'Partner': np.random.choice(['Yes', 'No'], 1000),
    'Dependents': np.random.choice(['Yes', 'No'], 1000),
    'tenure': np.random.randint(1, 72, 1000),
    'PhoneService': np.random.choice(['Yes', 'No'], 1000),
    'MultipleLines': np.random.choice(['Yes', 'No', 'No phone service'], 1000),
    'InternetService': np.random.choice(['DSL', 'Fiber optic', 'No'], 1000),
    'OnlineSecurity': np.random.choice(['Yes', 'No', 'No internet service'], 1000),
    'Contract': np.random.choice(['Month-to-month', 'One year', 'Two year'], 1000),
    'PaperlessBilling': np.random.choice(['Yes', 'No'], 1000),
    'PaymentMethod': np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer', 'Credit card'], 1000),
    'MonthlyCharges': np.random.uniform(20, 120, 1000),
    'TotalCharges': np.random.uniform(100, 8000, 1000),
    'Churn': np.random.choice(['Yes', 'No'], 1000, p=[0.27, 0.73])  # 27% churn rate (realistic)
})

# Display first few rows
print("üìä Dataset Preview:")
print(data.head())

# Display dataset info
print("\nüìà Dataset Information:")
print(data.info())

# Display target distribution
print("\nüéØ Target Variable Distribution:")
print(data['Churn'].value_counts())
print("\nChurn Rate:")
print(data['Churn'].value_counts(normalize=True) * 100)

## üßπ Step 2: Data Cleaning (MUST DO BEFORE PYCARET!)

**‚ö†Ô∏è IMPORTANT:** Always clean your data BEFORE using PyCaret.

Common cleaning steps:
1. Remove unnecessary columns (like IDs)
2. Handle missing values
3. Fix data types
4. Remove duplicates

In [None]:
# ===== DATA CLEANING =====

print("üßπ Starting Data Cleaning...\n")

# 1. Remove unnecessary columns
print("1Ô∏è‚É£ Removing customerID column (not predictive)")
data_clean = data.drop(columns=['customerID'])

# 2. Check for missing values
print("\n2Ô∏è‚É£ Checking for missing values:")
print(data_clean.isnull().sum())

# 3. Fix TotalCharges column (common issue in telecom datasets)
print("\n3Ô∏è‚É£ Converting TotalCharges to numeric (if it's not already)")
data_clean['TotalCharges'] = pd.to_numeric(data_clean['TotalCharges'], errors='coerce')

# 4. Check for missing values again
missing_after = data_clean.isnull().sum()
if missing_after.sum() > 0:
    print(f"\n‚ö†Ô∏è Found {missing_after.sum()} missing values:")
    print(missing_after[missing_after > 0])
    print("\nPyCaret will handle these automatically during setup() ‚úÖ")
else:
    print("\n‚úÖ No missing values found!")

# 5. Check for duplicates
duplicates = data_clean.duplicated().sum()
print(f"\n4Ô∏è‚É£ Checking for duplicates: {duplicates} found")
if duplicates > 0:
    data_clean = data_clean.drop_duplicates()
    print("‚úÖ Duplicates removed")

# 6. Verify data types
print("\n5Ô∏è‚É£ Data Types:")
print(data_clean.dtypes)

print("\n‚úÖ Data cleaning complete!")
print(f"Final dataset shape: {data_clean.shape}")

---

# 6Ô∏è‚É£ PyCaret Model Training - TWO APPROACHES

## üîÄ Approach 1: Using `compare_models()` (Automated)

**When to use:**
- You don't know which algorithm works best
- You want to try multiple models quickly
- You want PyCaret to recommend the best algorithm

**Pros:**
- Saves time
- Tests 15+ algorithms automatically
- Finds the best model based on your metric

**Cons:**
- Takes longer to run
- Less control over which model is used

### Setup PyCaret Environment

In [None]:
# ===== APPROACH 1: USING COMPARE_MODELS =====

print("üöÄ Setting up PyCaret environment...\n")

# Setup PyCaret
clf1 = setup(
    data=data_clean,
    target='Churn',
    session_id=42,  # For reproducibility
    
    # ===== DATA PREPROCESSING =====
    normalize=True,  # Scale numeric features (StandardScaler)
    transformation=False,  # Don't apply power transformations
    
    # ===== CATEGORICAL ENCODING =====
    categorical_features=['gender', 'Partner', 'Dependents', 'PhoneService', 
                         'MultipleLines', 'InternetService', 'OnlineSecurity',
                         'Contract', 'PaperlessBilling', 'PaymentMethod'],
    
    # ===== NUMERIC FEATURES =====
    numeric_features=['tenure', 'MonthlyCharges', 'TotalCharges'],
    
    # ===== IMBALANCE HANDLING =====
    # OPTION 1: Handle imbalance (use if target is imbalanced)
    fix_imbalance=True,  # Uses SMOTE automatically
    
    # OPTION 2: Don't handle imbalance (use if target is balanced)
    # fix_imbalance=False,
    
    # ===== OUTLIER REMOVAL =====
    remove_outliers=False,  # Set to True if you want to remove outliers
    
    # ===== MULTICOLLINEARITY =====
    remove_multicollinearity=True,  # Remove highly correlated features
    multicollinearity_threshold=0.9,
    
    # ===== UNKNOWN CATEGORIES =====
    handle_unknown_categorical=True,  # Important for deployment!
    
    # ===== SILENT MODE =====
    silent=True,  # Don't show setup summary (set to False to see details)
    
    # ===== VERBOSITY =====
    verbose=False
)

print("\n‚úÖ PyCaret setup complete!")

### Compare All Models

In [None]:
# Compare all available models
print("\nüîç Comparing all classification algorithms...\n")

best_model_1 = compare_models(
    sort='Accuracy',  # Can also use 'AUC', 'F1', 'Recall', 'Precision'
    n_select=1,  # Return top 1 model (can select top 3 with n_select=3)
    fold=5,  # 5-fold cross-validation (default is 10)
    # exclude=['knn', 'nb'],  # Optional: exclude specific models
    turbo=True  # Faster comparison (uses fewer iterations)
)

# Display which algorithm was selected
print("\n" + "="*50)
print(f"üèÜ BEST ALGORITHM SELECTED: {type(best_model_1).__name__}")
print("="*50)
print(f"\nFull Model Details:\n{best_model_1}")

### Tune the Best Model (OPTIONAL)

In [None]:
# ===== OPTIONAL: HYPERPARAMETER TUNING =====

print("\nüéØ Tuning hyperparameters...\n")

tuned_model_1 = tune_model(
    best_model_1,
    optimize='Accuracy',  # Metric to optimize
    n_iter=10,  # Number of iterations (higher = better but slower)
    fold=5  # Cross-validation folds
)

print("\n‚úÖ Tuning complete!")

### Finalize and Save Model

In [None]:
# Finalize model (train on full dataset)
print("\nüì¶ Finalizing model...")

# Choose whether to use tuned or untuned model
# OPTION 1: Use tuned model
final_model_1 = finalize_model(tuned_model_1)

# OPTION 2: Use untuned model (skip tuning step)
# final_model_1 = finalize_model(best_model_1)

print("‚úÖ Model finalized!\n")

# Save the model
print("üíæ Saving model...")
save_model(final_model_1, 'churn_model_compare')
print("‚úÖ Model saved as 'churn_model_compare.pkl'")

---

## üéØ Approach 2: Using Specific Model (Manual)

**When to use:**
- You already know which algorithm to use
- You want faster training
- Your boss/client requested a specific algorithm
- You want more control

**Pros:**
- Faster (trains only one model)
- Full control over algorithm choice
- Better for production when you know what works

**Cons:**
- You might miss a better algorithm
- Requires domain knowledge

In [None]:
# ===== APPROACH 2: USING SPECIFIC MODEL =====

print("\nüéØ Training specific model (Random Forest)...\n")

# Setup PyCaret again (required if you want to train a new model)
clf2 = setup(
    data=data_clean,
    target='Churn',
    session_id=42,
    normalize=True,
    fix_imbalance=True,
    silent=True,
    verbose=False
)

# Create specific model
# Options: 'lr', 'knn', 'nb', 'dt', 'rf', 'xgboost', 'lightgbm', 'catboost', etc.

# Example 1: Random Forest
rf_model = create_model('rf', fold=5)

# Example 2: XGBoost (uncomment to use)
# xgb_model = create_model('xgboost', fold=5)

# Example 3: Logistic Regression (uncomment to use)
# lr_model = create_model('lr', fold=5)

print(f"\n‚úÖ Model created: {type(rf_model).__name__}")

In [None]:
# OPTIONAL: Tune the model
print("\nüéØ Tuning Random Forest...\n")

tuned_rf = tune_model(
    rf_model,
    optimize='AUC',  # Changed to AUC for imbalanced datasets
    n_iter=10
)

print("‚úÖ Tuning complete!")

In [None]:
# Finalize and save
print("\nüì¶ Finalizing Random Forest model...")

# OPTION 1: Use tuned model
final_rf = finalize_model(tuned_rf)

# OPTION 2: Use untuned model (uncomment to skip tuning)
# final_rf = finalize_model(rf_model)

print("‚úÖ Model finalized!\n")

# Save the model
print("üíæ Saving model...")
save_model(final_rf, 'churn_model_rf_specific')
print("‚úÖ Model saved as 'churn_model_rf_specific.pkl'")

---

# 7Ô∏è‚É£ Model Evaluation and Interpretation

In [None]:
# Evaluate model performance
print("üìä Evaluating model performance...\n")

# Plot confusion matrix
plot_model(final_rf, plot='confusion_matrix')

In [None]:
# Plot ROC curve
plot_model(final_rf, plot='auc')

In [None]:
# Feature importance
plot_model(final_rf, plot='feature')

In [None]:
# Get predictions on test set
predictions = predict_model(final_rf)
print("\nüîÆ Sample Predictions:")
print(predictions[['Churn', 'prediction_label', 'prediction_score']].head(10))

---

# 8Ô∏è‚É£ Making Predictions on New Data

In [None]:
# Create new customer data for prediction
new_customer = pd.DataFrame({
    'gender': ['Male'],
    'SeniorCitizen': [0],
    'Partner': ['Yes'],
    'Dependents': ['No'],
    'tenure': [24],
    'PhoneService': ['Yes'],
    'MultipleLines': ['Yes'],
    'InternetService': ['Fiber optic'],
    'OnlineSecurity': ['No'],
    'Contract': ['Month-to-month'],
    'PaperlessBilling': ['Yes'],
    'PaymentMethod': ['Electronic check'],
    'MonthlyCharges': [85.50],
    'TotalCharges': [2052.00]
})

print("üÜï New Customer Data:")
print(new_customer)

# Make prediction
prediction = predict_model(final_rf, data=new_customer)

print("\nüîÆ Prediction Results:")
print(f"Will Churn: {prediction['prediction_label'][0]}")
print(f"Confidence: {prediction['prediction_score'][0]:.2%}")

---

# 9Ô∏è‚É£ Streamlit Deployment

## üìù Complete Streamlit App Code

Save this code as `app.py` in the same folder as your model.

In [None]:
# This cell contains the Streamlit app code
# Copy this entire code block and save it as 'app.py'

streamlit_app_code = '''
import streamlit as st
import pandas as pd
from pycaret.classification import load_model, predict_model

# ===== PAGE CONFIGURATION =====
st.set_page_config(
    page_title="Customer Churn Prediction",
    page_icon="üìä",
    layout="wide"
)

# ===== LOAD MODEL =====
@st.cache_resource
def load_churn_model():
    return load_model('churn_model_rf_specific')

model = load_churn_model()

# ===== TITLE AND DESCRIPTION =====
st.title("üìä Customer Churn Prediction App")
st.markdown("""
This app predicts whether a customer will churn (leave) based on their profile.
Fill in the customer details below and click **Predict** to see the results.
""")

# ===== SIDEBAR FOR INPUTS =====
st.sidebar.header("Customer Information")

# Demographics
st.sidebar.subheader("Demographics")
gender = st.sidebar.selectbox("Gender", ["Male", "Female"])
senior_citizen = st.sidebar.selectbox("Senior Citizen", ["No", "Yes"])
partner = st.sidebar.selectbox("Partner", ["Yes", "No"])
dependents = st.sidebar.selectbox("Dependents", ["Yes", "No"])

# Account Information
st.sidebar.subheader("Account Information")
tenure = st.sidebar.slider("Tenure (months)", 1, 72, 24)
contract = st.sidebar.selectbox("Contract Type", ["Month-to-month", "One year", "Two year"])
paperless_billing = st.sidebar.selectbox("Paperless Billing", ["Yes", "No"])
payment_method = st.sidebar.selectbox(
    "Payment Method", 
    ["Electronic check", "Mailed check", "Bank transfer", "Credit card"]
)

# Services
st.sidebar.subheader("Services")
phone_service = st.sidebar.selectbox("Phone Service", ["Yes", "No"])
multiple_lines = st.sidebar.selectbox(
    "Multiple Lines", 
    ["Yes", "No", "No phone service"]
)
internet_service = st.sidebar.selectbox(
    "Internet Service", 
    ["DSL", "Fiber optic", "No"]
)
online_security = st.sidebar.selectbox(
    "Online Security", 
    ["Yes", "No", "No internet service"]
)

# Charges
st.sidebar.subheader("Billing")
monthly_charges = st.sidebar.number_input(
    "Monthly Charges (‚Ç¶)", 
    min_value=0.0, 
    max_value=200.0, 
    value=70.0,
    step=5.0
)
total_charges = st.sidebar.number_input(
    "Total Charges (‚Ç¶)", 
    min_value=0.0, 
    max_value=10000.0, 
    value=1680.0,
    step=100.0
)

# ===== CREATE PREDICTION BUTTON =====
predict_button = st.sidebar.button("üîÆ Predict Churn", type="primary")

# ===== MAIN AREA =====
col1, col2 = st.columns(2)

with col1:
    st.subheader("üìã Customer Profile")
    
    # Create input dataframe
    input_data = pd.DataFrame({
        'gender': [gender],
        'SeniorCitizen': [1 if senior_citizen == "Yes" else 0],
        'Partner': [partner],
        'Dependents': [dependents],
        'tenure': [tenure],
        'PhoneService': [phone_service],
        'MultipleLines': [multiple_lines],
        'InternetService': [internet_service],
        'OnlineSecurity': [online_security],
        'Contract': [contract],
        'PaperlessBilling': [paperless_billing],
        'PaymentMethod': [payment_method],
        'MonthlyCharges': [monthly_charges],
        'TotalCharges': [total_charges]
    })
    
    st.dataframe(input_data.T, use_container_width=True)

with col2:
    st.subheader("üîÆ Prediction Results")
    
    if predict_button:
        with st.spinner("Making prediction..."):
            # Make prediction
            prediction = predict_model(model, data=input_data)
            
            # Extract results
            churn_prediction = prediction['prediction_label'][0]
            churn_probability = prediction['prediction_score'][0]
            
            # Display results
            if churn_prediction == "Yes":
                st.error("‚ö†Ô∏è HIGH RISK: Customer likely to churn")
                st.metric(
                    "Churn Probability", 
                    f"{churn_probability:.1%}",
                    delta="High Risk",
                    delta_color="inverse"
                )
                st.warning("""
                **Recommended Actions:**
                - Offer retention discount
                - Upgrade to longer contract
                - Improve customer service
                - Provide loyalty rewards
                """)
            else:
                st.success("‚úÖ LOW RISK: Customer likely to stay")
                st.metric(
                    "Retention Probability", 
                    f"{(1 - churn_probability):.1%}",
                    delta="Low Risk",
                    delta_color="normal"
                )
                st.info("""
                **Recommended Actions:**
                - Continue excellent service
                - Offer upsell opportunities
                - Request feedback/reviews
                - Cross-sell additional services
                """)
    else:
        st.info("üëà Fill in customer details and click 'Predict Churn'")

# ===== FOOTER =====
st.markdown("---")
st.markdown("""
<div style='text-align: center'>
    <p>Built with ‚ù§Ô∏è using PyCaret and Streamlit</p>
    <p>Data Science Lead: Victor Bolade | Blossom Academy</p>
</div>
""", unsafe_allow_html=True)
'''

# Save the Streamlit app code to a file
with open('app.py', 'w') as f:
    f.write(streamlit_app_code)

print("‚úÖ Streamlit app code saved as 'app.py'")
print("\nTo run the app, use this command in your terminal:")
print("streamlit run app.py")

## üöÄ How to Run Streamlit App

### Step 1: Make sure you have the model saved
```python
# This was already done earlier in the notebook
save_model(final_rf, 'churn_model_rf_specific')
```

### Step 2: Save the app.py file
The code above already saved `app.py` in your current directory.

### Step 3: Run Streamlit
Open your terminal/command prompt and run:
```bash
streamlit run app.py
```

### Step 4: Access the app
Your browser will automatically open to `http://localhost:8501`

## üåç Deploy to Streamlit Cloud (FREE!)

1. Push your code to GitHub (app.py + model file)
2. Go to https://share.streamlit.io
3. Connect your GitHub repository
4. Deploy with one click!
5. Get a public URL like: `https://yourapp.streamlit.app`

---

# üîü Summary: Complete Workflows

## üìã Workflow 1: Full Automation (compare_models)

```python
# 1. Clean data
data = pd.read_csv('data.csv')
data = data.drop(columns=['customerID'])
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# 2. Setup PyCaret
clf = setup(
    data=data,
    target='Churn',
    session_id=42,
    normalize=True,
    fix_imbalance=True,  # Handle imbalanced data
    silent=True
)

# 3. Compare models
best = compare_models()
print(f"Best Algorithm: {type(best).__name__}")

# 4. Tune (optional)
tuned = tune_model(best)

# 5. Finalize
final = finalize_model(tuned)

# 6. Save
save_model(final, 'model')
```

## üéØ Workflow 2: Specific Model (faster)

```python
# 1. Clean data (same as above)
data = pd.read_csv('data.csv')
data = data.drop(columns=['customerID'])
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')

# 2. Setup PyCaret
clf = setup(
    data=data,
    target='Churn',
    session_id=42,
    normalize=True,
    fix_imbalance=True,
    silent=True
)

# 3. Create specific model
model = create_model('rf')  # or 'xgboost', 'lightgbm', 'catboost'
print(f"Model: {type(model).__name__}")

# 4. Tune (optional)
tuned = tune_model(model)

# 5. Finalize
final = finalize_model(tuned)

# 6. Save
save_model(final, 'model')
```

## ‚ö° Workflow 3: Minimal (no tuning)

```python
# 1. Clean data
data = pd.read_csv('data.csv')
data = data.drop(columns=['customerID'])

# 2. Setup
clf = setup(data=data, target='Churn', session_id=42, silent=True)

# 3. Train
model = create_model('rf')

# 4. Finalize and save
final = finalize_model(model)
save_model(final, 'model')
```

---

# 1Ô∏è‚É£1Ô∏è‚É£ Key Parameters Reference

## üìä setup() Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `data` | DataFrame | Required | Your pandas DataFrame |
| `target` | str | Required | Target column name |
| `session_id` | int | None | Random seed for reproducibility |
| `normalize` | bool | False | Scale numeric features |
| `transformation` | bool | False | Apply power transformations |
| `fix_imbalance` | bool | False | Handle class imbalance with SMOTE |
| `remove_outliers` | bool | False | Remove outliers |
| `remove_multicollinearity` | bool | False | Remove correlated features |
| `categorical_features` | list | None | List of categorical columns |
| `numeric_features` | list | None | List of numeric columns |
| `ignore_features` | list | None | Columns to ignore |
| `handle_unknown_categorical` | bool | True | Handle unseen categories |
| `silent` | bool | False | Suppress output |

## üîç compare_models() Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `sort` | str | 'Accuracy' | Metric to sort by |
| `n_select` | int | 1 | Number of models to return |
| `fold` | int | 10 | Cross-validation folds |
| `exclude` | list | None | Models to exclude |
| `turbo` | bool | True | Faster comparison |

## ‚öôÔ∏è tune_model() Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `optimize` | str | 'Accuracy' | Metric to optimize |
| `n_iter` | int | 10 | Number of iterations |
| `fold` | int | 10 | Cross-validation folds |
| `custom_grid` | dict | None | Custom hyperparameter grid |

---

# 1Ô∏è‚É£2Ô∏è‚É£ Common Questions & Answers

## ‚ùì Q1: Do I HAVE to use compare_models()?
**A:** No! You can directly create a specific model with `create_model('rf')` if you already know which algorithm to use.

## ‚ùì Q2: Do I HAVE to tune the model?
**A:** No! Tuning is optional. You can skip it and go straight to `finalize_model(best_model)`.

## ‚ùì Q3: How do I know which algorithm was selected?
**A:** Use `print(type(best_model).__name__)` after `compare_models()`.

## ‚ùì Q4: What if my dataset is imbalanced?
**A:** Set `fix_imbalance=True` in `setup()`. PyCaret will automatically use SMOTE.

## ‚ùì Q5: Can I use my own custom algorithm?
**A:** Yes! You can pass any scikit-learn compatible estimator to `create_model()`.

## ‚ùì Q6: How do I deploy to production?
**A:** Use Streamlit (easiest), Flask, or FastAPI. PyCaret models work with all frameworks.

## ‚ùì Q7: Can I use PyCaret with deep learning?
**A:** For deep learning, use PyTorch or TensorFlow. PyCaret focuses on classical ML and gradient boosting.

## ‚ùì Q8: What's the difference between finalize_model() and create_model()?
**A:** `create_model()` trains on training set only. `finalize_model()` retrains on the entire dataset for deployment.

## ‚ùì Q9: Can I use PyCaret for regression?
**A:** Yes! Use `from pycaret.regression import *` instead of classification.

## ‚ùì Q10: Does PyCaret work with categorical data?
**A:** Yes! PyCaret automatically handles categorical encoding.

---

# 1Ô∏è‚É£3Ô∏è‚É£ Best Practices for Production

## ‚úÖ DO

1. **Always clean data before PyCaret** - Remove IDs, handle missing values
2. **Use session_id** - For reproducible results
3. **Handle imbalance** - Set `fix_imbalance=True` if target is imbalanced
4. **Always finalize** - Call `finalize_model()` before deployment
5. **Save preprocessing** - PyCaret saves both model AND preprocessing pipeline
6. **Use cross-validation** - Default 10-fold is good, 5-fold for large datasets
7. **Version your models** - Use descriptive names: `churn_model_v2.pkl`
8. **Monitor performance** - Track metrics in production

## ‚ùå DON'T

1. **Don't skip data cleaning** - PyCaret is not magic, garbage in = garbage out
2. **Don't use setup() twice** - Creates new experiment each time
3. **Don't deploy without finalize** - Model won't perform optimally
4. **Don't ignore class imbalance** - Can lead to poor performance
5. **Don't over-tune** - Diminishing returns after 50 iterations
6. **Don't trust compare_models blindly** - Validate on hold-out test set
7. **Don't use base environment** - Always create conda environment
8. **Don't hard-code paths** - Use relative paths or environment variables

---

# 1Ô∏è‚É£4Ô∏è‚É£ Next Steps

## üéì For Learning

1. Try different datasets from Kaggle
2. Experiment with all classification algorithms
3. Learn about model interpretability (SHAP, LIME)
4. Practice hyperparameter tuning
5. Build a portfolio project

## üöÄ For Production

1. Set up MLflow for experiment tracking
2. Create CI/CD pipeline for model deployment
3. Implement model monitoring
4. Add authentication to Streamlit app
5. Deploy on cloud (AWS, Azure, GCP)

## üìö Resources

- **PyCaret Documentation:** https://pycaret.org
- **Streamlit Documentation:** https://docs.streamlit.io
- **PyCaret Tutorials:** https://pycaret.gitbook.io
- **Streamlit Gallery:** https://streamlit.io/gallery

---

# üéâ Congratulations!

You've learned:

‚úÖ What PyCaret is and why it's powerful

‚úÖ What Streamlit is and how to deploy ML apps

‚úÖ Complete ML workflow from data cleaning to deployment

‚úÖ Two approaches: automated (compare_models) vs manual (create_model)

‚úÖ Handling imbalanced datasets

‚úÖ Optional steps (tuning, outlier removal, etc.)

‚úÖ Production-ready best practices

---

**Created by:** Victor Bolade - Data Science Lead Instructor

**Institution:** Blossom Academy, Ghana

**Program:** AI, Data Science & Machine Learning (WFP Partnership)

**Date:** January 2026

---

## üí° Remember

> "The best model is the one that solves your business problem, not the one with the highest accuracy."

Good luck with your machine learning journey! üöÄ