### Summary of Notebook 4.0: Model Training and Evaluation

This notebook serves as the **model training and evaluation stage** of the machine learning project pipeline. Its primary objective is to select the best-performing model for sentiment analysis and prepare it for deployment.

The notebook's workflow consists of the following key steps:

#### Data Loading
The notebook begins by efficiently loading the **prepared and vectorized data** from a `.pkl` file. This data, which includes training, validation, and test sets, was generated in the previous `3.0-data-preparation` notebook. This approach ensures reproducibility and avoids the need to re-run the entire data preparation pipeline.

#### Model Experimentation and Selection
A **comparative analysis** of three different machine learning algorithms is performed on the validation set:
* Logistic Regression
* Multinomial Naive Bayes
* Linear Support Vector Machine (SVM)

This process, facilitated by the shared API of Scikit-learn's models, allows for the rapid identification of the best-performing algorithm. Based on the validation results, the **Multinomial Naive Bayes** model was selected for its superior performance.

#### Final Evaluation
The selected model is then evaluated on a completely **unseen test set**. This single, final run provides an unbiased and reliable measure of the model's true real-world performance. This step is crucial for validating the model's generalizability and ensuring that its performance is not a result of overfitting to the validation data.

#### Model Persistence
The final, best-performing model and the TF-IDF vectorizer are saved to the project's `models` directory as `.pkl` files. This allows the trained model to be loaded and used for new predictions without the need for time-consuming retraining, a practice essential for model deployment.

#### Section 1: Load Prepared Data and Libraries

This first cell loads all the necessary libraries and the prepared data from the .pkl file created in the previous notebook.

In [None]:
import joblib
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Define the correct file path for the prepared data
# The '../' moves up one directory from the 'notebooks' folder
file_path = '../models/prepared_data.pkl'

# Load the prepared data from the .pkl file
prepared_data = joblib.load(file_path)

# Unpack the data into their respective variables
X_train_vec = prepared_data['X_train_vec']
X_val_vec = prepared_data['X_val_vec']
X_test_vec = prepared_data['X_test_vec']
y_train = prepared_data['y_train']
y_val = prepared_data['y_val']
y_test = prepared_data['y_test']
vectorizer = prepared_data['vectorizer']

print("Prepared data loaded successfully!")

Prepared data loaded successfully!


#### Section 2: Model Experimentation and Evaluation

This section defines a dictionary of models to test and then uses a loop to train and evaluate each one on the validation set.

In [4]:
print("\n--- Model Experimentation and Evaluation ---")

# Define a dictionary of models to experiment with
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Multinomial Naive Bayes': MultinomialNB(),
    'Linear SVM': LinearSVC(random_state=42)
}

# Loop through each model, train it, and evaluate its performance
for model_name, model in models.items():
    print(f"\nTraining and evaluating: {model_name}")

    # Train the model
    model.fit(X_train_vec, y_train)

    # Make predictions on the validation set
    y_val_pred = model.predict(X_val_vec)

    # Evaluate the model
    accuracy = accuracy_score(y_val, y_val_pred)
    print(f"Accuracy: {accuracy:.4f}")
    print("Classification Report:\n", classification_report(y_val, y_val_pred))


--- Model Experimentation and Evaluation ---

Training and evaluating: Logistic Regression
Accuracy: 0.7667
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.75      0.76        75
           1       0.76      0.79      0.77        75

    accuracy                           0.77       150
   macro avg       0.77      0.77      0.77       150
weighted avg       0.77      0.77      0.77       150


Training and evaluating: Multinomial Naive Bayes
Accuracy: 0.8067
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.85      0.82        75
           1       0.84      0.76      0.80        75

    accuracy                           0.81       150
   macro avg       0.81      0.81      0.81       150
weighted avg       0.81      0.81      0.81       150


Training and evaluating: Linear SVM
Accuracy: 0.7867
Classification Report:
               precision    recall  f1-score 

#### Section 3: Final Evaluation on the Test Set

After reviewing the results from the previous cell, we identified Multinomial Naive Bayes as the best-performing model. This cell performs the final, single evaluation on the unseen test set to get an unbiased performance score.

In [5]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

print("\n--- Final Evaluation on the Test Set ---")

# Re-initialize the best model (Multinomial Naive Bayes)
best_model = MultinomialNB()
best_model.fit(X_train_vec, y_train)

# Make predictions on the unseen test set
y_test_pred = best_model.predict(X_test_vec)

# Evaluate the model's performance on the test set
test_accuracy = accuracy_score(y_test, y_test_pred)
print("Test Set Accuracy:", test_accuracy)
print("\nTest Set Classification Report:\n", classification_report(y_test, y_test_pred))
print("\nTest Set Confusion Matrix:\n", confusion_matrix(y_test, y_test_pred))


--- Final Evaluation on the Test Set ---
Test Set Accuracy: 0.78

Test Set Classification Report:
               precision    recall  f1-score   support

           0       0.73      0.88      0.80        75
           1       0.85      0.68      0.76        75

    accuracy                           0.78       150
   macro avg       0.79      0.78      0.78       150
weighted avg       0.79      0.78      0.78       150


Test Set Confusion Matrix:
 [[66  9]
 [24 51]]


#### Section 4: Save Final Model and Vectorizer

This final cell saves the trained model to disk. Model persistence is a crucial step that enables the trained model to be used for making predictions on new data without requiring retraining.

In [6]:
import joblib

# Define file paths to save the model and vectorizer
model_file_path = '../models/sentiment_model.pkl'

# Save the best trained model to a file
joblib.dump(best_model, model_file_path)
print(f"\nSelected model saved to {model_file_path}")


Selected model saved to ../models/sentiment_model.pkl
