# Complete Guide to PyCaret and Streamlit for Machine Learning





---

# Part 1: Understanding PyCaret

## What is PyCaret?

PyCaret is an open-source, low-code machine learning library in Python that automates machine learning workflows. It is a Python wrapper around several machine learning libraries and frameworks such as scikit-learn, XGBoost, LightGBM, CatBoost, and many others.

## Why Use PyCaret?

Traditional machine learning workflows require extensive code for:
- Data preprocessing (scaling, encoding, missing value imputation)
- Feature engineering
- Model training and evaluation
- Hyperparameter tuning
- Model deployment

PyCaret reduces this complexity by providing a high-level API that handles these tasks automatically.

## Traditional Approach vs PyCaret

### Traditional scikit-learn approach (100+ lines of code):

```python
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
import pandas as pd

# Load data
data = pd.read_csv('data.csv')

# Separate features and target
X = data.drop('target', axis=1)
y = data['target']

# Encode categorical variables
le = LabelEncoder()
for col in X.select_dtypes(include='object').columns:
    X[col] = le.fit_transform(X[col])

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Scale features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Train model
model = RandomForestClassifier()
model.fit(X_train, y_train)

# Evaluate
predictions = model.predict(X_test)
print(f"Accuracy: {accuracy_score(y_test, predictions)}")
```

### PyCaret approach (5 lines of code):

```python
from pycaret.classification import *

# Setup handles all preprocessing automatically
clf = setup(data=data, target='target', session_id=42)

# Compare all models automatically
best_model = compare_models()
```

## PyCaret Modules

PyCaret provides modules for different machine learning tasks:

- **pycaret.classification** - Binary and multiclass classification
- **pycaret.regression** - Regression tasks
- **pycaret.clustering** - Unsupervised clustering
- **pycaret.anomaly** - Anomaly detection
- **pycaret.nlp** - Natural language processing
- **pycaret.time_series** - Time series forecasting

This notebook focuses on **classification**.

---

# Part 2: PyCaret Functions and Parameters

We will examine each function in detail, including all parameters and their usage.

## Function 1: setup()

### Purpose

The `setup()` function initializes the training environment and creates the transformation pipeline. This is the most important function in PyCaret and must be called before any other functions.

### What setup() Does

1. Infers data types (numeric vs categorical)
2. Splits data into train and test sets
3. Handles missing values
4. Encodes categorical variables
5. Scales/normalizes numeric features
6. Handles class imbalance (if specified)
7. Removes outliers (if specified)
8. Performs feature engineering (if specified)

### Required Parameters

```python
clf = setup(
    data=data,        # pandas DataFrame
    target='column'   # name of target column
)
```

### All Parameters with Explanations

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| **data** | DataFrame | Required | Pandas DataFrame containing features and target |
| **target** | str | Required | Name of the target column |
| **train_size** | float | 0.7 | Proportion of data for training (0.7 = 70% train, 30% test) |
| **test_data** | DataFrame | None | Separate DataFrame to use as test set |
| **preprocess** | bool | True | Whether to apply preprocessing |
| **imputation_type** | str | 'simple' | Method for handling missing values: 'simple' or 'iterative' |
| **numeric_imputation** | str | 'mean' | Strategy for numeric missing values: 'mean', 'median', 'mode', 'knn' |
| **categorical_imputation** | str | 'mode' | Strategy for categorical missing values: 'mode', 'constant' |
| **categorical_features** | list | None | List of columns to treat as categorical |
| **numeric_features** | list | None | List of columns to treat as numeric |
| **date_features** | list | None | List of columns to treat as dates |
| **text_features** | list | None | List of columns containing text |
| **ignore_features** | list | None | List of columns to ignore during modeling |
| **ordinal_features** | dict | None | Dictionary mapping ordinal features to their order |
| **high_cardinality_features** | list | None | Categorical features with many unique values |
| **handle_unknown_categorical** | bool | True | How to handle categories in deployment not seen during training |
| **unknown_categorical_method** | str | 'least_frequent' | Method to handle unknown categories |
| **normalize** | bool | False | Whether to normalize (scale) numeric features using Z-score |
| **normalize_method** | str | 'zscore' | Normalization method: 'zscore', 'minmax', 'maxabs', 'robust' |
| **transformation** | bool | False | Whether to apply power transformations to make data more Gaussian |
| **transformation_method** | str | 'yeo-johnson' | Power transformation method: 'yeo-johnson' or 'quantile' |
| **pca** | bool | False | Whether to apply Principal Component Analysis |
| **pca_method** | str | 'linear' | PCA method: 'linear', 'kernel', 'incremental' |
| **pca_components** | int/float | None | Number of components (int) or variance to retain (float between 0-1) |
| **feature_selection** | bool | False | Whether to perform feature selection |
| **feature_selection_method** | str | 'classic' | Method: 'univariate', 'classic', 'sequential' |
| **feature_selection_estimator** | str | 'lightgbm' | Algorithm to use for feature importance |
| **n_features_to_select** | int/float | 0.2 | Number of features to select |
| **remove_outliers** | bool | False | Whether to remove outliers from training data |
| **outliers_method** | str | 'iforest' | Outlier detection method: 'iforest', 'ee', 'lof' |
| **outliers_threshold** | float | 0.05 | Percentage of outliers to remove |
| **fix_imbalance** | bool | False | Whether to fix class imbalance in target variable |
| **fix_imbalance_method** | object | None | Custom resampling object (default uses SMOTE) |
| **remove_multicollinearity** | bool | False | Whether to remove highly correlated features |
| **multicollinearity_threshold** | float | 0.9 | Correlation threshold above which to remove features |
| **polynomial_features** | bool | False | Whether to create polynomial and interaction features |
| **polynomial_degree** | int | 2 | Degree of polynomial features |
| **bin_numeric_features** | list | None | List of numeric features to bin/discretize |
| **group_features** | list | None | List of features to group (aggregate statistics) |
| **session_id** | int | None | Random seed for reproducibility |
| **log_experiment** | bool | False | Whether to log experiment to MLflow |
| **experiment_name** | str | None | Name for MLflow experiment |
| **log_plots** | bool | False | Whether to log plots to MLflow |
| **log_profile** | bool | False | Whether to log data profile |
| **log_data** | bool | False | Whether to log training and test data |
| **verbose** | bool | True | Whether to print information during setup |

### Common Usage Examples

#### Basic setup:
```python
clf = setup(data=data, target='Churn', session_id=42)
```

#### Setup with normalization:
```python
clf = setup(
    data=data,
    target='Churn',
    session_id=42,
    normalize=True
)
```

#### Setup with imbalance handling:
```python
clf = setup(
    data=data,
    target='Churn',
    session_id=42,
    fix_imbalance=True
)
```

#### Full setup with multiple options:
```python
clf = setup(
    data=data,
    target='Churn',
    session_id=42,
    normalize=True,
    categorical_features=['PhoneService', 'Contract'],
    numeric_features=['tenure', 'MonthlyCharges'],
    fix_imbalance=True,
    remove_outliers=True,
    remove_multicollinearity=True
)
```

## Function 2: compare_models()

### Purpose

This function trains and evaluates all available classification algorithms using cross-validation and returns the best model based on a specified metric.

### Is This Function Required?

**No.** This function is OPTIONAL. You can skip it and directly create a specific model using `create_model()`.

### When to Use

- When you don't know which algorithm will perform best
- When you want to quickly evaluate multiple algorithms
- For initial exploratory analysis

### When NOT to Use

- When you already know which algorithm to use
- When time is limited (compare_models is slower)
- When you want full control over the algorithm

### Available Algorithms

When you run `compare_models()`, PyCaret tests these algorithms:

| Algorithm ID | Full Name | Description |
|--------------|-----------|-------------|
| lr | Logistic Regression | Simple linear classifier |
| knn | K-Nearest Neighbors | Instance-based learning |
| nb | Naive Bayes | Probabilistic classifier |
| dt | Decision Tree | Tree-based classifier |
| svm | Support Vector Machine (Linear) | Linear SVM |
| rbfsvm | SVM with RBF Kernel | Non-linear SVM |
| gpc | Gaussian Process Classifier | Probabilistic non-parametric |
| mlp | Multi-Layer Perceptron | Neural network |
| ridge | Ridge Classifier | Regularized linear model |
| rf | Random Forest | Ensemble of decision trees |
| qda | Quadratic Discriminant Analysis | Non-linear discriminant |
| ada | AdaBoost Classifier | Boosting algorithm |
| gbc | Gradient Boosting Classifier | Gradient boosting |
| lda | Linear Discriminant Analysis | Linear discriminant |
| et | Extra Trees Classifier | Randomized decision trees |
| xgboost | Extreme Gradient Boosting | Advanced gradient boosting |
| lightgbm | Light Gradient Boosting | Fast gradient boosting |
| catboost | CatBoost Classifier | Handles categorical data well |

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| **include** | list | None | List of algorithm IDs to include |
| **exclude** | list | None | List of algorithm IDs to exclude |
| **fold** | int | 10 | Number of cross-validation folds |
| **round** | int | 4 | Decimal places for metrics |
| **cross_validation** | bool | True | Whether to use cross-validation |
| **sort** | str | 'Accuracy' | Metric to sort by: 'Accuracy', 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', 'MCC' |
| **n_select** | int | 1 | Number of top models to return |
| **budget_time** | float | None | Maximum time in minutes for compare_models |
| **turbo** | bool | True | When True, uses fewer iterations for faster comparison |
| **errors** | str | 'ignore' | How to handle errors: 'ignore' or 'raise' |
| **fit_kwargs** | dict | None | Custom parameters to pass to fit method |
| **groups** | str | None | Column name for GroupKFold cross-validation |
| **verbose** | bool | True | Whether to print progress |

### Usage Examples

#### Basic usage (returns best model):
```python
best_model = compare_models()
```

#### Sort by different metric:
```python
best_model = compare_models(sort='AUC')
```

#### Return top 3 models:
```python
top_3_models = compare_models(n_select=3)
```

#### Exclude specific models:
```python
best_model = compare_models(exclude=['knn', 'nb'])
```

#### Include only specific models:
```python
best_model = compare_models(include=['rf', 'xgboost', 'lightgbm'])
```

### How to Know Which Algorithm Was Selected

```python
best_model = compare_models()

# Print the algorithm name
print(f"Selected Algorithm: {type(best_model).__name__}")

# Or print the full model object
print(best_model)
```

## Function 3: create_model()

### Purpose

This function trains a specific classification model. Use this when you know exactly which algorithm you want to use.

### When to Use

- When you already know which algorithm to use
- When you want faster training (trains only one model)
- When your requirements specify a particular algorithm
- When you want more control

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| **estimator** | str or object | Required | Algorithm ID (e.g., 'rf', 'xgboost') or sklearn estimator object |
| **fold** | int | 10 | Number of cross-validation folds |
| **round** | int | 4 | Decimal places for metrics |
| **cross_validation** | bool | True | Whether to use cross-validation |
| **fit_kwargs** | dict | None | Custom parameters to pass to fit method |
| **groups** | str | None | Column name for GroupKFold |
| **verbose** | bool | True | Whether to print progress |

### Usage Examples

#### Create Random Forest:
```python
rf_model = create_model('rf')
```

#### Create XGBoost:
```python
xgb_model = create_model('xgboost')
```

#### Create Logistic Regression:
```python
lr_model = create_model('lr')
```

#### Create with fewer folds (faster):
```python
rf_model = create_model('rf', fold=5)
```

#### Create with custom sklearn estimator:
```python
from sklearn.ensemble import RandomForestClassifier

custom_rf = RandomForestClassifier(n_estimators=200, max_depth=10)
model = create_model(custom_rf)
```

## Function 4: tune_model()

### Purpose

This function performs hyperparameter tuning on a trained model to improve its performance.

### Is This Function Required?

**No.** Tuning is OPTIONAL. You can skip this step and go directly to `finalize_model()`.

### When to Use

- When you want to maximize model performance
- When you have time for optimization
- For production models where accuracy is critical

### When NOT to Use

- When time is limited
- When the baseline model is good enough
- For quick prototypes

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| **estimator** | object | Required | Trained model object from create_model() or compare_models() |
| **optimize** | str | 'Accuracy' | Metric to optimize: 'Accuracy', 'AUC', 'Recall', 'Precision', 'F1', 'Kappa', 'MCC' |
| **fold** | int | 10 | Number of cross-validation folds |
| **round** | int | 4 | Decimal places for metrics |
| **n_iter** | int | 10 | Number of iterations for random/grid search |
| **custom_grid** | dict | None | Custom hyperparameter grid |
| **search_library** | str | 'scikit-learn' | Library for hyperparameter search: 'scikit-learn', 'scikit-optimize', 'tune-sklearn', 'optuna' |
| **search_algorithm** | str | 'random' | Search algorithm: 'random' or 'grid' |
| **early_stopping** | bool/str | False | Whether to use early stopping |
| **early_stopping_max_iters** | int | 10 | Maximum iterations without improvement |
| **choose_better** | bool | True | Whether to return tuned model only if better than original |
| **fit_kwargs** | dict | None | Custom parameters for fit method |
| **groups** | str | None | Column name for GroupKFold |
| **verbose** | bool | True | Whether to print progress |

### Usage Examples

#### Basic tuning:
```python
best_model = compare_models()
tuned_model = tune_model(best_model)
```

#### Optimize for AUC:
```python
tuned_model = tune_model(best_model, optimize='AUC')
```

#### More iterations for better optimization:
```python
tuned_model = tune_model(best_model, n_iter=50)
```

#### Custom hyperparameter grid:
```python
custom_params = {
    'n_estimators': [100, 200, 300],
    'max_depth': [5, 10, 15, 20],
    'min_samples_split': [2, 5, 10]
}
tuned_model = tune_model(rf_model, custom_grid=custom_params)
```

#### Using grid search instead of random:
```python
tuned_model = tune_model(best_model, search_algorithm='grid')
```

## Function 5: finalize_model()

### Purpose

This function trains the model on the complete dataset (training + test sets) to maximize learning before deployment.

### Is This Function Required?

**Yes.** This is REQUIRED before deployment. Always call this function before saving your model.

### Why It's Important

During training and tuning, models are trained only on the training set. The test set is reserved for evaluation. Before deployment, you want to train on ALL available data to get the best possible model.

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| **estimator** | object | Required | Trained model object |
| **fit_kwargs** | dict | None | Custom parameters for fit method |

### Usage Examples

#### Finalize after compare_models:
```python
best_model = compare_models()
final_model = finalize_model(best_model)
```

#### Finalize after tuning:
```python
best_model = compare_models()
tuned_model = tune_model(best_model)
final_model = finalize_model(tuned_model)
```

#### Finalize a specific model:
```python
rf_model = create_model('rf')
final_model = finalize_model(rf_model)
```

## Function 6: save_model()

### Purpose

This function saves the trained model and the entire preprocessing pipeline to a file for later use in deployment.

### Is This Function Required?

**Yes, for deployment.** If you want to use the model later (in Streamlit, Flask, etc.), you must save it.

### What Gets Saved

- The trained model
- All preprocessing transformations (scaling, encoding, etc.)
- Feature names and types

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| **model** | object | Required | Trained model object (preferably finalized) |
| **model_name** | str | Required | Name for the saved model file (without extension) |
| **model_only** | bool | False | If True, saves only model; if False, saves model + pipeline |

### Usage Examples

#### Basic save:
```python
save_model(final_model, 'my_model')
# Creates: my_model.pkl
```

#### Save with path:
```python
save_model(final_model, 'models/churn_model')
# Creates: models/churn_model.pkl
```

#### Save only model (no preprocessing):
```python
save_model(final_model, 'my_model', model_only=True)
```

## Function 7: load_model()

### Purpose

This function loads a previously saved model for making predictions.

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| **model_name** | str | Required | Name of saved model file (without .pkl extension) |

### Usage Examples

```python
model = load_model('my_model')
# or
model = load_model('models/churn_model')
```

## Function 8: predict_model()

### Purpose

This function generates predictions on new data using a trained model.

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| **estimator** | object | Required | Trained model object |
| **data** | DataFrame | None | Data to predict on (if None, uses test set) |
| **round** | int | 4 | Decimal places for prediction probabilities |

### Returns

DataFrame with original data plus:
- `prediction_label` - Predicted class
- `prediction_score` - Prediction probability

### Usage Examples

#### Predict on test set:
```python
predictions = predict_model(final_model)
```

#### Predict on new data:
```python
new_data = pd.DataFrame({...})
predictions = predict_model(final_model, data=new_data)
```

## Function 9: plot_model()

### Purpose

This function creates visualizations for model evaluation and interpretation.

### Parameters

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| **estimator** | object | Required | Trained model object |
| **plot** | str | 'auc' | Type of plot to create |
| **save** | bool | False | Whether to save plot as image |

### Available Plots

| Plot ID | Description |
|---------|-------------|
| auc | Area Under Curve (ROC) |
| confusion_matrix | Confusion Matrix |
| threshold | Discrimination Threshold |
| pr | Precision-Recall Curve |
| error | Class Prediction Error |
| class_report | Classification Report |
| boundary | Decision Boundary |
| rfe | Recursive Feature Elimination |
| learning | Learning Curve |
| manifold | Manifold Learning |
| calibration | Calibration Curve |
| vc | Validation Curve |
| dimension | Dimension Learning |
| feature | Feature Importance |
| feature_all | Feature Importance (All) |

### Usage Examples

```python
# Plot ROC curve
plot_model(final_model, plot='auc')

# Plot confusion matrix
plot_model(final_model, plot='confusion_matrix')

# Plot feature importance
plot_model(final_model, plot='feature')
```

---

# Part 3: Complete Workflows

Now that we understand all functions, let's see complete workflows with both required and optional steps.

## Preparation: Import Libraries and Load Data

In [None]:
import pandas as pd
from pycaret.classification import *

print("Libraries imported successfully")

## Data Cleaning - REQUIRED BEFORE PYCARET

You must clean your data before using PyCaret. This is NOT optional.

---

# Workflow 1: Using compare_models (Automated Approach)

This workflow lets PyCaret find the best algorithm automatically.

## Step 1: Setup PyCaret Environment

## Step 2: Compare All Models

## Step 3: Tune the Model (OPTIONAL)

You can skip this step if you want.

## Step 4: Finalize Model (REQUIRED)

## Step 5: Save Model (REQUIRED for Deployment)

---

# Workflow 2: Using Specific Model (Manual Approach)

This workflow uses a specific algorithm directly, skipping compare_models.

## Step 1: Setup PyCaret Environment

Setup is always required, no matter which workflow you use.

## Step 2: Create Specific Model

Choose the algorithm you want to use.

## Step 3: Tune the Model (OPTIONAL)

Again, this is optional.

## Step 4: Finalize Model (REQUIRED)

## Step 5: Save Model (REQUIRED for Deployment)

---

# Workflow 3: Minimal Approach (No Tuning, No Compare)

This is the fastest workflow - just train and deploy.

---

# Part 4: Model Evaluation

---

# Part 5: Making Predictions

In [None]:
# Predict on new customer data
new_customer = pd.DataFrame({
    'gender': ['Male'],
    'SeniorCitizen': [0],
    'Partner': ['Yes'],
    'Dependents': ['No'],
    'tenure': [24],
    'PhoneService': ['Yes'],
    'MultipleLines': ['Yes'],
    'InternetService': ['Fiber optic'],
    'OnlineSecurity': ['No'],
    'Contract': ['Month-to-month'],
    'PaperlessBilling': ['Yes'],
    'PaymentMethod': ['Electronic check'],
    'MonthlyCharges': [85.50],
    'TotalCharges': [2052.00]
})



---

# Part 6: Understanding Streamlit

## What is Streamlit?

Streamlit is an open-source Python framework that allows you to create interactive web applications for machine learning and data science projects without needing to know HTML, CSS, or JavaScript.

## Why Use Streamlit?

Traditional web frameworks like Flask or Django require:
- Frontend development (HTML, CSS, JavaScript)
- Backend development (routing, form handling)
- Template management
- Complex deployment

Streamlit simplifies this to pure Python code.

## Streamlit vs Traditional Web Development

### Traditional Flask Approach:

```python
# app.py (Backend)
from flask import Flask, render_template, request

app = Flask(__name__)

@app.route('/')
def home():
    return render_template('index.html')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.form.to_dict()
    # More code...
```

Plus you need separate HTML files:

```html
<!-- index.html (Frontend) -->
<!DOCTYPE html>
<html>
<head>
    <title>My App</title>
    <link rel="stylesheet" href="style.css">
</head>
<body>
    <form method="POST" action="/predict">
        <!-- Form fields -->
    </form>
</body>
</html>
```

### Streamlit Approach (Pure Python):

```python
import streamlit as st

st.title("My ML App")
age = st.number_input("Enter age")
if st.button("Predict"):
    prediction = model.predict([[age]])
    st.write(f"Result: {prediction}")
```

## Key Streamlit Features

1. **Pure Python** - No HTML/CSS/JavaScript required
2. **Interactive Widgets** - Buttons, sliders, text inputs, file uploaders
3. **Data Display** - DataFrames, charts, images, maps
4. **Fast Development** - Build apps in minutes, not hours
5. **Easy Deployment** - Deploy to Streamlit Cloud for free
6. **Automatic Rerun** - Page updates automatically when code changes

## Common Streamlit Widgets

```python
# Text input
name = st.text_input("Enter your name")

# Number input
age = st.number_input("Enter age", min_value=0, max_value=120)

# Slider
rating = st.slider("Rating", 0, 10, 5)

# Selectbox (dropdown)
option = st.selectbox("Choose option", ["A", "B", "C"])

# Checkbox
agree = st.checkbox("I agree")

# Button
if st.button("Click me"):
    st.write("Button clicked!")

# File uploader
file = st.file_uploader("Upload CSV", type=['csv'])
```

## Complete Streamlit App for Our Churn Model

The following code creates a production-ready Streamlit app. Save this as `app.py`.

In [None]:
# This cell contains the complete Streamlit app code
# Save this as 'app.py' in the same folder as your saved model

streamlit_code = '''
import streamlit as st
import pandas as pd
from pycaret.classification import load_model, predict_model

# Page configuration
st.set_page_config(
    page_title="Customer Churn Prediction",
    page_icon="ðŸ“Š",
    layout="wide"
)

# Load the saved model
@st.cache_resource
def load_churn_model():
    return load_model('models/churn_model_rf')

model = load_churn_model()

# Title
st.title("Customer Churn Prediction System")
st.markdown("""
This application predicts whether a customer will churn based on their profile.
Enter the customer information in the sidebar and click 'Predict'.
""")

# Sidebar for inputs
st.sidebar.header("Customer Information")

# Demographics
st.sidebar.subheader("Demographics")
gender = st.sidebar.selectbox("Gender", ["Male", "Female"])
senior_citizen = st.sidebar.selectbox("Senior Citizen", ["No", "Yes"])
partner = st.sidebar.selectbox("Has Partner", ["Yes", "No"])
dependents = st.sidebar.selectbox("Has Dependents", ["Yes", "No"])

# Account Information
st.sidebar.subheader("Account Information")
tenure = st.sidebar.slider("Tenure (months)", 1, 72, 24)
contract = st.sidebar.selectbox(
    "Contract Type", 
    ["Month-to-month", "One year", "Two year"]
)
paperless_billing = st.sidebar.selectbox("Paperless Billing", ["Yes", "No"])
payment_method = st.sidebar.selectbox(
    "Payment Method",
    ["Electronic check", "Mailed check", "Bank transfer", "Credit card"]
)

# Services
st.sidebar.subheader("Services")
phone_service = st.sidebar.selectbox("Phone Service", ["Yes", "No"])
multiple_lines = st.sidebar.selectbox(
    "Multiple Lines",
    ["Yes", "No", "No phone service"]
)
internet_service = st.sidebar.selectbox(
    "Internet Service",
    ["DSL", "Fiber optic", "No"]
)
online_security = st.sidebar.selectbox(
    "Online Security",
    ["Yes", "No", "No internet service"]
)

# Billing
st.sidebar.subheader("Billing")
monthly_charges = st.sidebar.number_input(
    "Monthly Charges",
    min_value=0.0,
    max_value=200.0,
    value=70.0,
    step=5.0
)
total_charges = st.sidebar.number_input(
    "Total Charges",
    min_value=0.0,
    max_value=10000.0,
    value=1680.0,
    step=100.0
)

# Predict button
predict_button = st.sidebar.button("Predict Churn", type="primary")

# Main area - split into two columns
col1, col2 = st.columns(2)

with col1:
    st.subheader("Customer Profile")
    
    # Create input dataframe
    input_data = pd.DataFrame({
        'gender': [gender],
        'SeniorCitizen': [1 if senior_citizen == "Yes" else 0],
        'Partner': [partner],
        'Dependents': [dependents],
        'tenure': [tenure],
        'PhoneService': [phone_service],
        'MultipleLines': [multiple_lines],
        'InternetService': [internet_service],
        'OnlineSecurity': [online_security],
        'Contract': [contract],
        'PaperlessBilling': [paperless_billing],
        'PaymentMethod': [payment_method],
        'MonthlyCharges': [monthly_charges],
        'TotalCharges': [total_charges]
    })
    
    # Display input as table
    st.dataframe(input_data.T, use_container_width=True)

with col2:
    st.subheader("Prediction Results")
    
    if predict_button:
        with st.spinner("Making prediction..."):
            # Make prediction
            prediction = predict_model(model, data=input_data)
            
            # Extract results
            churn_prediction = prediction['prediction_label'][0]
            churn_probability = prediction['prediction_score'][0]
            
            # Display results
            if churn_prediction == "Yes":
                st.error("HIGH RISK: Customer likely to churn")
                st.metric(
                    "Churn Probability",
                    f"{churn_probability:.1%}",
                    delta="High Risk",
                    delta_color="inverse"
                )
                st.warning("""
                **Recommended Actions:**
                - Contact customer for retention offer
                - Provide discount or upgrade incentive
                - Improve customer service engagement
                - Consider loyalty rewards
                """)
            else:
                st.success("LOW RISK: Customer likely to stay")
                st.metric(
                    "Retention Probability",
                    f"{(1 - churn_probability):.1%}",
                    delta="Low Risk",
                    delta_color="normal"
                )
                st.info("""
                **Recommended Actions:**
                - Maintain current service quality
                - Explore upsell opportunities
                - Request customer feedback
                - Cross-sell additional services
                """)
    else:
        st.info("Fill in customer details in the sidebar and click 'Predict Churn'")

# Footer
st.markdown("---")
st.markdown("""
<div style='text-align: center'>
    <p>Built with PyCaret and Streamlit</p>
    <p>Blossom Academy - Data Science Program</p>
</div>
""", unsafe_allow_html=True)
'''

# Save the Streamlit app code
with open('app.py', 'w') as f:
    f.write(streamlit_code)

print("Streamlit app saved as 'app.py'")
print("\nTo run the app:")
print("streamlit run app.py")

## How to Run the Streamlit App

### Step 1: Ensure you have the model saved

Make sure you have run the model training code above and saved the model.

### Step 2: Save the app.py file

The cell above saves the Streamlit code as `app.py`.

### Step 3: Run Streamlit

Open your terminal or command prompt and navigate to the folder containing `app.py`, then run:

```bash
streamlit run app.py
```

### Step 4: Access the app

Your browser will automatically open to `http://localhost:8501` showing your app.

## Deploying to Streamlit Cloud (Free)

1. Create a GitHub repository
2. Upload your code:
   - app.py
   - requirements.txt (containing: pycaret, streamlit)
   - Your saved model file
3. Go to https://share.streamlit.io
4. Sign in with GitHub
5. Connect your repository
6. Deploy
7. Get a public URL like: https://your-app.streamlit.app

---

# Summary

## Required Steps

1. Clean your data
2. setup()
3. Either compare_models() OR create_model()
4. finalize_model()
5. save_model()

## Optional Steps

1. tune_model() - Skip if you don't need hyperparameter optimization
2. plot_model() - Skip if you don't need visualizations
3. Parameters in setup() like fix_imbalance, remove_outliers - Use only if needed

## Workflow Comparison

| Workflow | Speed | Control | When to Use |
|----------|-------|---------|-------------|
| compare_models | Slow | Low | Exploration, don't know best algorithm |
| create_model | Fast | High | Know which algorithm to use |
| With tuning | Slow | Medium | Need best performance |
| Without tuning | Fast | Medium | Good enough performance |

## PyCaret + Streamlit Benefits

1. Rapid prototyping - Build ML apps in hours, not days
2. No frontend knowledge required - Pure Python
3. Production-ready - Deploy to cloud for free
4. Reproducible - session_id ensures consistent results
5. Automated - Handles preprocessing automatically

