# 3. Model Results Inspection

**Goal:** Load and inspect the outputs from the ML models that were trained and run by the main `src/` pipeline.

In [1]:
import pandas as pd
import sys
import os
from pathlib import Path

# Define output paths
TOPICS_SUMMARY_PATH = Path("../outputs/topics_summary.csv")
DOCS_WITH_SENTIMENT_PATH = Path("../outputs/docs_with_sentiment.csv")

# Ensure models directory exists
if not os.path.exists('../models'):
    os.makedirs('../models')

### 1. Inspect Topic Modeling Results

The `src/topic_modeling.py` script already ran SBERT, UMAP, and HDBSCAN. We will load its final summary file.

In [2]:
if TOPICS_SUMMARY_PATH.exists():
    print(f"Loading topic summary from: {TOPICS_SUMMARY_PATH}")
    topics_df = pd.read_csv(TOPICS_SUMMARY_PATH)
    display(topics_df.head())
else:
    print(f"ERROR: {TOPICS_SUMMARY_PATH} not found. Please run the main pipeline first.")

Loading topic summary from: ../outputs/topics_summary.csv


Unnamed: 0,topic_id,n_mentions,top_terms,examples
0,0,21417,"['product', 'slickers', 'printed', 'windows', ...","['. ', '. ', '. ', '. ', '. ']"
1,-1,3376,"['br', 'coffee', 'great', 'good', 'like', 'pro...","[""WOW! That's some good espresso. I use this ..."
2,6,520,"['tea', 'br', 'great', 'like', 'good', 'green'...","[""powder instead of leaf. First, let me state ..."
3,36,401,"['br', 'orange', 'juice', 'drink', 'soda', 'br...","['Flavorful, and good!. This small can of juic..."
4,45,370,"['chocolate', 'hot', 'cocoa', 'br', 'hot choco...",['Great cocoa !!. Super good price and the bes...


### 2. Inspect Sentiment Analysis Results

The `src/sentiment_models.py` script already ran VADER over all documents. We'll load the results.

In [3]:
if DOCS_WITH_SENTIMENT_PATH.exists():
    print(f"Loading sentiment results from: {DOCS_WITH_SENTIMENT_PATH}")
    sentiment_df = pd.read_csv(DOCS_WITH_SENTIMENT_PATH, low_memory=False)
    
    print("Sample of sentiment results:")
    display(sentiment_df[['text_clean', 'topic_id', 'vader_compound', 'sent_label']].head())
    
    print("\nOverall Sentiment Distribution:")
    display(sentiment_df['sent_label'].value_counts(normalize=True))
else:
    print(f"ERROR: {DOCS_WITH_SENTIMENT_PATH} not found. Please run the main pipeline first.")

Loading sentiment results from: ../outputs/docs_with_sentiment.csv
Sample of sentiment results:


Unnamed: 0,text_clean,topic_id,vader_compound,sent_label
0,Good Quality Dog Food. I have bought several o...,10,0.9583,pos
1,Not as Advertised. Product arrived labeled as ...,-1,-0.5664,neg
2,"""Delight"" says it all. This is a confection th...",-1,0.9066,pos
3,Cough Medicine. If you are looking for the sec...,6,0.4404,pos
4,Great taffy. Great taffy at a great price. Th...,-1,0.9661,pos



Overall Sentiment Distribution:


sent_label
neu    0.686370
pos    0.283009
neg    0.030621
Name: proportion, dtype: float64

### 3. Benchmarking (SKLearn Models)

This step is now handled conceptually and in `04_evaluation.ipynb`.

In our main pipeline, we made a strategic choice:
1.  **VADER:** Used in the pipeline for its extreme speed, which is necessary for processing 30,000+ documents quickly.
2.  **`bert_sentiment` model:** A more powerful, but much slower, Transformer model. This is kept in the `models/` folder and serves as our 'gold standard' benchmark and a future upgrade path.

Our `04_evaluation.ipynb` notebook is where we would formally compare their outputs on a sample of data.

### 4. Model Location Reference

The models used by the pipeline (SBERT, UMAP, HDBSCAN, VADER) are loaded dynamically *inside* the `src/` scripts.

The `models/bert_sentiment` folder is a saved, fine-tuned model that we use for benchmarking and as a planned 'Future Development' upgrade for even higher accuracy.