# 4.0 Model Training and Evaluation

## 📑 Notebook Overview

This notebook implements the **model training and evaluation stage** of the pipeline.  
Its primary goal is to select the best-performing model for sentiment analysis and prepare it for deployment.

**Workflow Overview:**
1. Load prepared, vectorized datasets from the `3.0-data-preparation` stage.  
2. Train and compare baseline models on the validation set.  
3. Select the best model and evaluate on the held-out test set.  
4. Save the final model and vectorizer for reproducible inference.


## 1 — Load Prepared Data and Libraries

Load all dependencies and import the preprocessed data artifacts (`prepared_data.pkl`).  
This file contains:
- Train, validation, and test splits (already vectorized).  
- Target labels (`y_train`, `y_val`, `y_test`).  
- The fitted TF-IDF vectorizer (ensures consistent transformations for new data).


In [None]:
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define the correct file path for the prepared data
# The '../' moves up one directory from the 'notebooks' folder
file_path = '../models/prepared_data.pkl'

# Load the prepared data from the .pkl file
prepared_data = joblib.load(file_path)

# Unpack the data into their respective variables
X_train_vec = prepared_data['X_train_vec']
X_val_vec = prepared_data['X_val_vec']
X_test_vec = prepared_data['X_test_vec']
y_train = prepared_data['y_train']
y_val = prepared_data['y_val']
y_test = prepared_data['y_test']
vectorizer = prepared_data['vectorizer']

print("Prepared data loaded successfully!")

Prepared data loaded successfully!


## 2 — Model Experimentation (Validation Set)

**Goal:** compare baseline models using the validation set.  
**Models tested:**
- Logistic Regression  
- Multinomial Naive Bayes  
- Linear Support Vector Machine (SVM)

Each model is trained on the training set, evaluated on the validation set, and compared using:  
- Accuracy  
- Classification Report (Precision, Recall, F1)  

This stage identifies the best candidate before touching the test set.


In [4]:
print("\n--- Model Experimentation and Evaluation ---")

# Define a dictionary of models to experiment with
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Multinomial Naive Bayes': MultinomialNB(),
    'Linear SVM': LinearSVC(random_state=42)
}

# Loop through each model, train it, and evaluate its performance
for model_name, model in models.items():
    print(f"\nTraining and evaluating: {model_name}")

    # Train the model
    model.fit(X_train_vec, y_train)

    # Make predictions on the validation set
    y_val_pred = model.predict(X_val_vec)

    # Evaluate the model
    accuracy = accuracy_score(y_val, y_val_pred)
    print(f"Accuracy: {accuracy:.4f}")
    print("Classification Report:\n", classification_report(y_val, y_val_pred))


--- Model Experimentation and Evaluation ---

Training and evaluating: Logistic Regression
Accuracy: 0.7667
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.75      0.76        75
           1       0.76      0.79      0.77        75

    accuracy                           0.77       150
   macro avg       0.77      0.77      0.77       150
weighted avg       0.77      0.77      0.77       150


Training and evaluating: Multinomial Naive Bayes
Accuracy: 0.8067
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.85      0.82        75
           1       0.84      0.76      0.80        75

    accuracy                           0.81       150
   macro avg       0.81      0.81      0.81       150
weighted avg       0.81      0.81      0.81       150


Training and evaluating: Linear SVM
Accuracy: 0.7867
Classification Report:
               precision    recall  f1-score 

## 3 — Final Evaluation (Test Set)

After reviewing validation results, the best-performing model is retrained on the **training set** and evaluated once on the **held-out test set**.

**Purpose:**
- Provides an unbiased estimate of real-world performance.  
- Ensures that model selection was not overly influenced by validation data.  

**Outputs:**
- Test set accuracy  
- Classification report  
- Confusion matrix


In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("\n--- Final Evaluation on the Test Set ---")

# Re-initialize the best model (Multinomial Naive Bayes)
best_model = MultinomialNB()
best_model.fit(X_train_vec, y_train)

# Make predictions on the unseen test set
y_test_pred = best_model.predict(X_test_vec)

# Evaluate the model's performance on the test set
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Set Accuracy:", test_accuracy)
print("\nTest Set Classification Report:\n", classification_report(y_test, y_test_pred))
print("\nTest Set Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))


--- Final Evaluation on the Test Set ---
Test Set Accuracy: 0.78

Test Set Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.88      0.80        75
           1       0.85      0.68      0.76        75

    accuracy                           0.78       150
   macro avg       0.79      0.78      0.78       150
weighted avg       0.79      0.78      0.78       150


Test Set Confusion Matrix:
 [[66  9]
 [24 51]]


## 4 — Model Persistence

The final model and TF-IDF vectorizer are saved as `.pkl` files in the `models/` directory.

**Why:**  
- Enables reproducible predictions without retraining.  
- Critical step for deployment and integration into production pipelines.


In [6]:
import joblib

# Define file paths to save the model and vectorizer
model_file_path = '../models/sentiment_model.pkl'

# Save the best trained model to a file
joblib.dump(best_model, model_file_path)
print(f"\nSelected model saved to {model_file_path}")


Selected model saved to ../models/sentiment_model.pkl


## 📊 Summary & Next Steps

**Findings (example based on validation results):**
- Multinomial Naive Bayes outperformed Logistic Regression and Linear SVM on this dataset.  
- Test set results confirm that performance generalizes well to unseen data.  

**Next Steps:**
1. Extend experiments with hyperparameter tuning (GridSearchCV, RandomizedSearchCV).  
2. Explore richer text features (n-grams, embeddings, engineered features).  
3. Wrap the final model in a simple prediction service (e.g., FastAPI or Flask).  
4. Monitor model performance on new, real-world data.
