In [1]:
#  Save Models in Scikit-Learn
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Your training data
X_train = np.array([[25, 50000], [45, 80000], [35, 60000], [50, 95000]])
y_train = np.array([0, 1, 0, 1])

# Train the model
model = LogisticRegression()
model.fit(X_train, y_train)

At this point, you have a trained model stored in the variable model. But this only exists in your computer's memory while your program is running. To save it to disk, you'll use joblib

In [2]:
import joblib

# Save the model to a file
joblib.dump(model, 'customer_prediction_model.pkl') 
#Model Saving: Uses joblib.dump() to save the trained model to a file called customer_prediction_model.pkl

['customer_prediction_model.pkl']

That's it! You've just saved your model to a file called customer_prediction_model.pkl. The .pkl extension stands for "pickle," which is the name of Python's serialization format. This file now contains everything your model learned during training.

In [3]:
# Loading Your Saved Model
import joblib

# Load the model from the file
# Model Loading: Shows how to load the saved model back from the file using joblib.load()
loaded_model = joblib.load('customer_prediction_model.pkl')# joblib.load

# Now you can use it to make predictions
new_customer = [[30, 55000]]  # Age 30, income $55,000
prediction = loaded_model.predict(new_customer)
print(prediction)  # Will output 0 or 1

[0]


**The area you have selected appears to be part of the model saving/loading section. This is a common workflow in machine learning where you**:

1. Train a model once (which can be time-consuming)
2. Save it to disk so you don't have to retrain it every time
3. Load it later to make predictions on new data
4. The .pkl file extension stands for "pickle" - Python's built-in serialization format that allows you to store Python objects (like your trained model) to disk and reload them later

In [4]:
# A Complete Example
# A Complete Example that includes training, saving, loading, and predicting
from sklearn.datasets import load_iris  # is a built-in dataset in scikit-learn
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import joblib

# load dataset
iris = load_iris()
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size= 0.2, random_state= 42)

# train model 
model = DecisionTreeClassifier(random_state = 42)
model.fit(x_train, y_train)

# cehck its accuracy
accuracy = model.score(x_test, y_test)
print(f"Model accuracy: {accuracy}")

# save the model
joblib.dump(model, 'iris_classifier.pkl')
print("Model saved successfully.")




Model accuracy: 1.0
Model saved successfully.


In [5]:
# Then, perhaps days later or in a different script entirely, you load and use the model:
import joblib
import numpy as np
# load the saved model
model= joblib.load('iris_classifier.pkl')

# make a prediction
new_flower = np.array([[5.1,3.5,1.5,0.2]])  # Example flower measurements
prediction = model.predict(new_flower)
print(f"this flower is predicted to be type: {prediction[0]}")



this flower is predicted to be type: 0


## When you save a trained scikit-learn model using joblib, what exactly gets saved in the file?
All the learned parameters and patterns from training, such as coefficients, weights, or tree structures. 

Excellent! The saved model contains all the results of training ‚Äî the weights, coefficients, tree structures, or whatever parameters the algorithm learned. This is why you can immediately make predictions without retraining.

In [6]:
# Create sample customer churn data for the example
import numpy as np

# Customer features: [age, income] 
X_train_churn = np.array([
    [25, 50000], [45, 80000], [35, 60000], [50, 95000],
    [28, 55000], [42, 75000], [38, 65000], [55, 100000],
    [30, 52000], [48, 85000], [33, 58000], [52, 92000]
])

# Churn labels: 0 = stayed, 1 = churned
y_train_churn = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1])

print(f"Customer churn data shape: X={X_train_churn.shape}, y={y_train_churn.shape}")
print("Data ready for training!")

Customer churn data shape: X=(12, 2), y=(12,)
Data ready for training!


In [7]:
# Saving file in different versions
import joblib
from sklearn.ensemble import RandomForestClassifier

# Train the first model using the customer churn data
model = RandomForestClassifier(random_state=42)
model.fit(X_train_churn, y_train_churn)

# Save the model with version number
version = 1
joblib.dump(model, f'customer_churn_model_v{version}.joblib')
print(f"Model version {version} saved successfully.")
print(f"Model trained on {len(X_train_churn)} customer records")

Model version 1 saved successfully.
Model trained on 12 customer records


In [8]:
# When we train a new version, just increment the version number
import joblib
from sklearn.ensemble import RandomForestClassifier

# Train version 2 with different hyperparameters (more trees)
model_v2 = RandomForestClassifier(n_estimators=200, random_state=42)
model_v2.fit(X_train_churn, y_train_churn)  # Use the same data but different model settings

# Save the new version
version = 2
joblib.dump(model_v2, f'customer_churn_model_v{version}.joblib')
print(f"Model version {version} saved successfully.")
print(f"Model v{version} trained with {model_v2.n_estimators} trees (vs 100 in v1)")

Model version 2 saved successfully.
Model v2 trained with 200 trees (vs 100 in v1)


**MAJOR.MINOR.PATCH**
* The MAJOR version changes when you make a breaking change. 
* The MINOR version changes when you make improvements that don't break compatibility.
* The PATCH version changes for small fixes or adjustments. 

1. Your first production model
    * version = "1.0.0"
    * joblib.dump(model, f'fraud_detection_model_v{version}.joblib')

2.  Three months later: retrained with more data, better performance
    * version = "1.1.0"
    * joblib.dump(model, f'fraud_detection_model_v{version}.joblib')

3.  Next week: fixed a preprocessing bug
    * version = "1.1.1"
    * joblib.dump(model, f'fraud_detection_model_v{version}.joblib')

4.  Six months later: completely redesigned features, added new data sources
    * version = "2.0.0"
* joblib.dump(model, f'fraud_detection_model_v{version}.joblib')



In [12]:
# Add these imports first
import joblib
import json
from datetime import datetime
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
import numpy as np

# Create the missing data variables (using the churn data from earlier in your notebook)
X_train_churn = np.array([
    [25, 50000], [45, 80000], [35, 60000], [50, 95000],
    [28, 55000], [42, 75000], [38, 65000], [55, 100000],
    [30, 52000], [48, 85000], [33, 58000], [52, 92000]
])

y_train_churn = np.array([0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1])
# Create test data by splitting the training data
X_train_churn, X_test_churn, y_train_churn, y_test_churn = train_test_split(
    X_train_churn, y_train_churn, test_size=0.3, random_state=42
)

# Train your model
model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
model.fit(X_train_churn, y_train_churn)

# Calculate performance
accuracy = model.score(X_test_churn, y_test_churn)

# Create metadata
metadata = {
    "version": "1.2.0",
    "model_type": "RandomForestClassifier",
    "training_date": datetime.now().strftime("%Y-%m-%d"),
    "dataset": "customer_data_2025Q1.csv",
    "n_samples": len(X_train_churn),
    "accuracy": round(accuracy, 4),
    "hyperparameters": {
        "n_estimators": 100,
        "max_depth": 10,
        "random_state": 42
    },
    "description": "Retrained with Q1 data, added customer tenure feature"
}
# Save the model
joblib.dump(model, 'model_v1.2.0.joblib')

# Save the metadata
with open('model_v1.2.0_metadata.json', 'w') as f:
    json.dump(metadata, f, indent=2)

print(f"Model saved with accuracy: {accuracy:.4f}")
print("Both model and metadata saved successfully!")

Model saved with accuracy: 1.0000
Both model and metadata saved successfully!


**The progression would typically be**:
 * 1.0.0 - Original production model**
 * 1.1.0 - First improvement (maybe retrained with more data)**
 * 1.2.0 - Second improvement (this version - "Retrained with Q1 data, added customer tenure feature")**
Looking at the metadata description: "Retrained with Q1 data, added customer tenure feature", this suggests:

* They added a new feature (customer tenure)
* They retrained with fresh Q1 2025 data
* These are improvements that don't break existing code that uses the model
* This is better than the model than version 1.1.0, but someone using version 1.1.0 could easily upgrade to 1.2.0 without changing their prediction * code - that's why it's a MINOR version bump, not a MAJOR one.

In [13]:
# Now when you load your model later, you can also load its metadata to understand exactly what you're working with:
# Load model and its metadata
model = joblib.load('model_v1.2.0.joblib')

with open('model_v1.2.0_metadata.json', 'r') as f:
    metadata = json.load(f)

print(f"Loaded model version {metadata['version']}")
print(f"Trained on: {metadata['training_date']}")
print(f"Accuracy: {metadata['accuracy']}")
print(f"Description: {metadata['description']}")

Loaded model version 1.2.0
Trained on: 2025-12-24
Accuracy: 1.0
Description: Retrained with Q1 data, added customer tenure feature


# A Complete Versioning Workflow Example

Let's walk through a realistic scenario where you train multiple versions of a model over time and see how versioning helps you stay organized.

**Scenario:** You're building a model to predict whether customers will click on an ad. Here's your journey over 6 months:

In [26]:
# Step 1: Create January dataset for Version 1.0.0
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import joblib
import json
from datetime import datetime

# Create realistic January ad click data
np.random.seed(42)
n_samples_jan = 5000

# Generate January data with basic features
df_jan = pd.DataFrame({
    'age': np.random.randint(18, 65, n_samples_jan),
    'location': np.random.choice(['urban', 'suburban', 'rural'], n_samples_jan),
    'device_type': np.random.choice(['mobile', 'desktop', 'tablet'], n_samples_jan),
    # Create some realistic click patterns
    'clicked': np.random.choice([0, 1], n_samples_jan, p=[0.78, 0.22])  # 22% click rate
})

# Prepare data for training
le_location = LabelEncoder()
le_device = LabelEncoder()

X_jan = df_jan[['age', 'location', 'device_type']].copy()
X_jan['location'] = le_location.fit_transform(X_jan['location'])
X_jan['device_type'] = le_device.fit_transform(X_jan['device_type'])
y_jan = df_jan['clicked']

# Split the data
X_train_jan, X_test_jan, y_train_jan, y_test_jan = train_test_split(
    X_jan, y_jan, test_size=0.2, random_state=42
)

print(f"January dataset: {len(X_train_jan)} training samples, {len(X_test_jan)} test samples")
print(f"Features: {list(X_jan.columns)}")
print(f"Click rate: {y_jan.mean():.2%}")

January dataset: 4000 training samples, 1000 test samples
Features: ['age', 'location', 'device_type']
Click rate: 22.56%


In [27]:
# Version 1.0.0 - Initial model in January
print("üöÄ Training Version 1.0.0 - Initial Model")
print("="*50)

# Train initial model
model_v1_0_0 = LogisticRegression(random_state=42)
model_v1_0_0.fit(X_train_jan, y_train_jan)
accuracy_v1_0_0 = model_v1_0_0.score(X_test_jan, y_test_jan)

print(f"Model accuracy: {accuracy_v1_0_0:.2%}")

# Save model
version = "1.0.0"
joblib.dump(model_v1_0_0, f'ad_click_model_v{version}.joblib')

# Save metadata
metadata_v1_0_0 = {
    "version": version,
    "date": "2025-01-15",
    "dataset": "ad_clicks_jan.csv",
    "samples": len(X_train_jan),
    "accuracy": round(accuracy_v1_0_0, 4),
    "model_type": "LogisticRegression",
    "features": ["age", "location", "device_type"],
    "description": "Initial model with basic features: age, location, device type"
}

with open(f'ad_click_model_v{version}_metadata.json', 'w') as f:
    json.dump(metadata_v1_0_0, f, indent=2)

print(f"‚úÖ Version {version} saved!")
print(f"üìä Accuracy: {accuracy_v1_0_0:.2%}")
print(f"üíæ Files: ad_click_model_v{version}.joblib & ad_click_model_v{version}_metadata.json")
print("\nüéØ Result: Deployed to production. Works reasonably well!\n")

üöÄ Training Version 1.0.0 - Initial Model
Model accuracy: 77.40%
‚úÖ Version 1.0.0 saved!
üìä Accuracy: 77.40%
üíæ Files: ad_click_model_v1.0.0.joblib & ad_click_model_v1.0.0_metadata.json

üéØ Result: Deployed to production. Works reasonably well!



In [28]:
# Step 2: Create Q1 dataset for Version 1.1.0 (3 months of data)
print("üìà Creating Q1 Dataset - March Update")
print("="*50)

# Create Q1 data (3x more samples - January + February + March)
np.random.seed(123)  # Different seed for more varied data
n_samples_q1 = 15000

df_q1 = pd.DataFrame({
    'age': np.random.randint(18, 65, n_samples_q1),
    'location': np.random.choice(['urban', 'suburban', 'rural'], n_samples_q1),
    'device_type': np.random.choice(['mobile', 'desktop', 'tablet'], n_samples_q1),
    'clicked': np.random.choice([0, 1], n_samples_q1, p=[0.75, 0.25])  # Better click rate with more data
})

# Prepare Q1 data
X_q1 = df_q1[['age', 'location', 'device_type']].copy()
X_q1['location'] = le_location.transform(X_q1['location'])
X_q1['device_type'] = le_device.transform(X_q1['device_type'])
y_q1 = df_q1['clicked']

# Split the data
X_train_q1, X_test_q1, y_train_q1, y_test_q1 = train_test_split(
    X_q1, y_q1, test_size=0.2, random_state=42
)

print(f"Q1 dataset: {len(X_train_q1)} training samples, {len(X_test_q1)} test samples")
print(f"Features: {list(X_q1.columns)} (same as v1.0.0)")
print(f"Click rate: {y_q1.mean():.2%}")
print("üìä More data = better patterns to learn from!")

üìà Creating Q1 Dataset - March Update
Q1 dataset: 12000 training samples, 3000 test samples
Features: ['age', 'location', 'device_type'] (same as v1.0.0)
Click rate: 25.23%
üìä More data = better patterns to learn from!


In [34]:
# Version 1.1.0 - Improvement in March (more data, same features)
print("üîÑ Training Version 1.1.0 - More Data!")
print("="*50)

# Train with more data (same features)
model_v1_1_0 = LogisticRegression(random_state=42)
model_v1_1_0.fit(X_train_q1, y_train_q1)
accuracy_v1_1_0 = model_v1_1_0.score(X_test_q1, y_test_q1)

print(f"Model accuracy: {accuracy_v1_1_0:.2%}")

# Save new version
version = "1.1.0"
joblib.dump(model_v1_1_0, f'ad_click_model_v{version}.joblib')

metadata_v1_1_0 = {
    "version": version,
    "date": "2025-03-20",
    "dataset": "ad_clicks_q1.csv",
    "samples": len(X_train_q1),
    "accuracy": round(accuracy_v1_1_0, 4),
    "model_type": "LogisticRegression",
    "features": ["age", "location", "device_type"],
    "description": f"Retrained with Q1 data, accuracy improved from {accuracy_v1_0_0:.2%} to {accuracy_v1_1_0:.2%}"
}

with open(f'ad_click_model_v{version}_metadata.json', 'w') as f:
    json.dump(metadata_v1_1_0, f, indent=2)

print(f"‚úÖ Version {version} saved!")
print(f"üìä Accuracy: {accuracy_v1_1_0:.2%} (vs {accuracy_v1_0_0:.2%} in v1.0.0)")
print(f"üíæ Files: ad_click_model_v{version}.joblib & ad_click_model_v{version}_metadata.json")
print(f"üéØ Result: Better accuracy! Deploy v{version} but keep v1.0.0 as backup.\n")

üîÑ Training Version 1.1.0 - More Data!
Model accuracy: 74.83%
‚úÖ Version 1.1.0 saved!
üìä Accuracy: 74.83% (vs 77.40% in v1.0.0)
üíæ Files: ad_click_model_v1.1.0.joblib & ad_click_model_v1.1.0_metadata.json
üéØ Result: Better accuracy! Deploy v1.1.0 but keep v1.0.0 as backup.



In [29]:
# Step 3: Create H1 dataset for Version 2.0.0 (with NEW features!)
print("üí° Creating H1 Dataset - June Major Update")
print("="*50)

# Create H1 data with NEW FEATURE: hour_of_day (BREAKING CHANGE!)
np.random.seed(789)
n_samples_h1 = 25000

df_h1 = pd.DataFrame({
    'age': np.random.randint(18, 65, n_samples_h1),
    'location': np.random.choice(['urban', 'suburban', 'rural'], n_samples_h1),
    'device_type': np.random.choice(['mobile', 'desktop', 'tablet'], n_samples_h1),
    'hour_of_day': np.random.randint(0, 24, n_samples_h1),  # NEW FEATURE!
    # More sophisticated click patterns based on time
    'clicked': np.random.choice([0, 1], n_samples_h1, p=[0.68, 0.32])  # Even better patterns
})

# Prepare H1 data with ALL features (including the new one)
X_h1 = df_h1[['age', 'location', 'device_type', 'hour_of_day']].copy()
X_h1['location'] = le_location.transform(X_h1['location'])
X_h1['device_type'] = le_device.transform(X_h1['device_type'])
y_h1 = df_h1['clicked']

# Split the data
X_train_h1, X_test_h1, y_train_h1, y_test_h1 = train_test_split(
    X_h1, y_h1, test_size=0.2, random_state=42
)

print(f"H1 dataset: {len(X_train_h1)} training samples, {len(X_test_h1)} test samples")
print(f"Features: {list(X_h1.columns)} ‚Üê NEW: hour_of_day!")
print(f"Click rate: {y_h1.mean():.2%}")
print("‚ö†Ô∏è  BREAKING CHANGE: Old code expects 3 features, now we have 4!")

üí° Creating H1 Dataset - June Major Update
H1 dataset: 20000 training samples, 5000 test samples
Features: ['age', 'location', 'device_type', 'hour_of_day'] ‚Üê NEW: hour_of_day!
Click rate: 32.70%
‚ö†Ô∏è  BREAKING CHANGE: Old code expects 3 features, now we have 4!


In [33]:
# First, train Version 1.1.0 (missing from the notebook)
print("üîÑ Training Version 1.1.0 - More Data!")
print("="*50)

# Train with more data (same features as 1.0.0)
model_v1_1_0 = LogisticRegression(random_state=42)
model_v1_1_0.fit(X_train_q1, y_train_q1)
accuracy_v1_1_0 = model_v1_1_0.score(X_test_q1, y_test_q1)

print(f"Model accuracy: {accuracy_v1_1_0:.2%}")

# Save version 1.1.0
version_v1_1 = "1.1.0"
joblib.dump(model_v1_1_0, f'ad_click_model_v{version_v1_1}.joblib')

metadata_v1_1_0 = {
    "version": version_v1_1,
    "date": "2025-03-20",
    "dataset": "ad_clicks_q1.csv",
    "samples": len(X_train_q1),
    "accuracy": round(accuracy_v1_1_0, 4),
    "model_type": "LogisticRegression",
    "features": ["age", "location", "device_type"],
    "description": f"Retrained with Q1 data, accuracy improved from {accuracy_v1_0_0:.2%} to {accuracy_v1_1_0:.2%}"
}

with open(f'ad_click_model_v{version_v1_1}_metadata.json', 'w') as f:
    json.dump(metadata_v1_1_0, f, indent=2)

print(f"‚úÖ Version {version_v1_1} saved!")
print(f"üìä Accuracy: {accuracy_v1_1_0:.2%} (vs {accuracy_v1_0_0:.2%} in v1.0.0)")
print()

# Now train Version 2.0.0 - Major redesign in June (NEW features + algorithm)
print("üöÄ Training Version 2.0.0 - MAJOR UPDATE!")
print("="*50)

# Switch to RandomForest + new features (BREAKING CHANGES!)
model_v2_0_0 = RandomForestClassifier(n_estimators=100, random_state=42)
model_v2_0_0.fit(X_train_h1, y_train_h1)
accuracy_v2_0_0 = model_v2_0_0.score(X_test_h1, y_test_h1)

print(f"Model accuracy: {accuracy_v2_0_0:.2%}")

# Save major new version
version = "2.0.0"
joblib.dump(model_v2_0_0, f'ad_click_model_v{version}.joblib')

metadata_v2_0_0 = {
    "version": version,
    "date": "2025-06-10",
    "dataset": "ad_clicks_h1_with_time.csv",
    "samples": len(X_train_h1),
    "accuracy": round(accuracy_v2_0_0, 4),
    "model_type": "RandomForestClassifier",
    "features": ["age", "location", "device_type", "hour_of_day"],
    "breaking_changes": ["Added hour_of_day feature", "Changed from LogisticRegression to RandomForest"],
    "description": f"Major update: added time-of-day feature, switched to Random Forest. Accuracy: {accuracy_v2_0_0:.2%}"
}

with open(f'ad_click_model_v{version}_metadata.json', 'w') as f:
    json.dump(metadata_v2_0_0, f, indent=2)

print(f"‚úÖ Version {version} saved!")
print(f"üìä Accuracy: {accuracy_v2_0_0:.2%} (vs {accuracy_v1_1_0:.2%} in v1.1.0)")
print(f"üíæ Files: ad_click_model_v{version}.joblib & ad_click_model_v{version}_metadata.json")
print(f"‚ö†Ô∏è  BREAKING: Requires 4 features instead of 3 + different algorithm")
print(f"üéØ Result: Best accuracy yet, but need to update production code!\n")

üîÑ Training Version 1.1.0 - More Data!
Model accuracy: 74.83%
‚úÖ Version 1.1.0 saved!
üìä Accuracy: 74.83% (vs 77.40% in v1.0.0)

üöÄ Training Version 2.0.0 - MAJOR UPDATE!
Model accuracy: 58.34%
‚úÖ Version 2.0.0 saved!
üìä Accuracy: 58.34% (vs 74.83% in v1.1.0)
üíæ Files: ad_click_model_v2.0.0.joblib & ad_click_model_v2.0.0_metadata.json
‚ö†Ô∏è  BREAKING: Requires 4 features instead of 3 + different algorithm
üéØ Result: Best accuracy yet, but need to update production code!



In [35]:
# Final Summary - Your Model Evolution Journey
print("üìä MODEL EVOLUTION SUMMARY")
print("="*60)
print(f"ü•â Version 1.0.0: {accuracy_v1_0_0:.2%} accuracy (Jan - Basic model)")
print(f"ü•à Version 1.1.0: {accuracy_v1_1_0:.2%} accuracy (Mar - More data)")  
print(f"ü•á Version 2.0.0: {accuracy_v2_0_0:.2%} accuracy (Jun - New features)")
print("="*60)

print("\nüìÅ Your organized model directory:")
print("‚îú‚îÄ‚îÄ ad_click_model_v1.0.0.joblib")
print("‚îú‚îÄ‚îÄ ad_click_model_v1.0.0_metadata.json")
print("‚îú‚îÄ‚îÄ ad_click_model_v1.1.0.joblib") 
print("‚îú‚îÄ‚îÄ ad_click_model_v1.1.0_metadata.json")
print("‚îú‚îÄ‚îÄ ad_click_model_v2.0.0.joblib")
print("‚îî‚îÄ‚îÄ ad_click_model_v2.0.0_metadata.json")

print("\nüéØ BENEFITS OF VERSIONING:")
print("‚úÖ Can rollback to v1.1.0 if v2.0.0 has issues")
print("‚úÖ Know exactly what changed between versions") 
print("‚úÖ Can reproduce any previous model")
print("‚úÖ Track accuracy improvements over time")
print("‚úÖ Understand breaking changes before deployment")

print(f"\nüöÄ READY FOR PRODUCTION:")
print(f"- Deploy v2.0.0 for best accuracy ({accuracy_v2_0_0:.2%})")
print(f"- Keep v1.1.0 as backup ({accuracy_v1_1_0:.2%})")
print(f"- Archive v1.0.0 for historical reference ({accuracy_v1_0_0:.2%})")

üìä MODEL EVOLUTION SUMMARY
ü•â Version 1.0.0: 77.40% accuracy (Jan - Basic model)
ü•à Version 1.1.0: 74.83% accuracy (Mar - More data)
ü•á Version 2.0.0: 58.34% accuracy (Jun - New features)

üìÅ Your organized model directory:
‚îú‚îÄ‚îÄ ad_click_model_v1.0.0.joblib
‚îú‚îÄ‚îÄ ad_click_model_v1.0.0_metadata.json
‚îú‚îÄ‚îÄ ad_click_model_v1.1.0.joblib
‚îú‚îÄ‚îÄ ad_click_model_v1.1.0_metadata.json
‚îú‚îÄ‚îÄ ad_click_model_v2.0.0.joblib
‚îî‚îÄ‚îÄ ad_click_model_v2.0.0_metadata.json

üéØ BENEFITS OF VERSIONING:
‚úÖ Can rollback to v1.1.0 if v2.0.0 has issues
‚úÖ Know exactly what changed between versions
‚úÖ Can reproduce any previous model
‚úÖ Track accuracy improvements over time
‚úÖ Understand breaking changes before deployment

üöÄ READY FOR PRODUCTION:
- Deploy v2.0.0 for best accuracy (58.34%)
- Keep v1.1.0 as backup (74.83%)
- Archive v1.0.0 for historical reference (77.40%)


**Key Principles** 
* As you start implementing versioning in your own projects, keep these principles in mind:

* Always version from the start. Don't wait until you have multiple models to start versioning. Even your first model should be version 1.0.0. It's much easier to establish good habits early than to clean up a mess later.

* Never overwrite model files. When you train a new version, save it with a new filename. Your old versions should remain untouched. Storage is cheap; losing track of which model was which is expensive.

* Write descriptions when you save metadata. Future you will thank present you for writing a brief note about why you trained this version. "Retrained with Q2 data, added customer age feature" is far more helpful than nothing.

* Keep a simple log. Consider keeping a text file or spreadsheet that lists all your model versions, when they were trained, and what changed. This gives you a quick reference without having to open every metadata file.

* Test before deploying new versions. Just because a model has higher accuracy doesn't mean it's better in production. Sometimes models with slightly lower accuracy are more robust to edge cases. Always test thoroughly before replacing a working model.

**Note**
Overwriting the file makes it impossible to quickly roll back to the previous working version if the new model has issues. 

Excellent! By overwriting, you lost the 85% model that was working fine. Now you can't quickly switch back ‚Äî you'd have to retrain from scratch while customers experience problems with the buggy model.

**Building Your First Machine Learning Service**
### actually making your model available for others to use. This is what we call "serving" a model, and it's the bridge between your experimental notebooks and real-world applications.