
# Hiver – Email Tagging Mini-System (Baseline)

This notebook implements a simple, end-to-end baseline for the **Hiver AI Intern Evaluation Assignment – Part A (Email Tagging Mini-System)**.

It includes:

- Data loading & basic exploration  
- A **pattern-based text classifier**  
- **Customer-specific tag isolation** during prediction  
- Simple **evaluation & error analysis**  
- Notes on **patterns, anti-patterns, and future improvements**


In [1]:

# Core imports
import pandas as pd
import numpy as np

from collections import defaultdict, Counter
import re

# For evaluation
from sklearn.metrics import classification_report, confusion_matrix



## 1. Load Dataset

Expected CSV format (`emails.csv`):

- `subject` (string)  
- `body` (string)  
- `customer_id` (string or int)  
- `tag` (ground truth label)


In [2]:

# Adjust the path if your CSV is elsewhere
DATA_PATH = "/kaggle/input/emails/emails.csv"

df = pd.read_csv(DATA_PATH)

print("Shape:", df.shape)
df.head()


Shape: (12, 5)


Unnamed: 0,email_id,customer_id,subject,body,tag
0,1,CUST_A,Unable to access shared mailbox,"Hi team, I'm unable to access the shared mailb...",access_issue
1,2,CUST_A,Rules not working,We created a rule to auto-assign emails based ...,workflow_issue
2,3,CUST_A,Email stuck in pending,One of our emails is stuck in pending even aft...,status_bug
3,4,CUST_B,Automation creating duplicate tasks,Your automation engine is creating 2 tasks for...,automation_bug
4,5,CUST_B,Tags missing,Many of our tags are not appearing for new ema...,tagging_issue



## 2. Basic Data Sanity Checks


In [3]:

print("Columns:", df.columns.tolist())
print("\nMissing values per column:")
print(df.isna().sum())

print("\nUnique customers:", df['customer_id'].nunique())
print("Unique tags overall:", df['tag'].nunique())

print("\nTags per customer:")
print(df.groupby("customer_id")["tag"].unique())


Columns: ['email_id', 'customer_id', 'subject', 'body', 'tag']

Missing values per column:
email_id       0
customer_id    0
subject        0
body           0
tag            0
dtype: int64

Unique customers: 4
Unique tags overall: 12

Tags per customer:
customer_id
CUST_A           [access_issue, workflow_issue, status_bug]
CUST_B             [automation_bug, tagging_issue, billing]
CUST_C           [analytics_issue, performance, setup_help]
CUST_D    [mail_merge_issue, user_management, feature_re...
Name: tag, dtype: object



## 3. Customer-Specific Tag Sets

We create a mapping: `customer_id -> list of tags`  
This ensures **tag isolation**, i.e., when predicting for a customer, we only allow that customer's tags.


In [4]:

customer_tags = df.groupby("customer_id")["tag"].unique().to_dict()
customer_tags


{'CUST_A': array(['access_issue', 'workflow_issue', 'status_bug'], dtype=object),
 'CUST_B': array(['automation_bug', 'tagging_issue', 'billing'], dtype=object),
 'CUST_C': array(['analytics_issue', 'performance', 'setup_help'], dtype=object),
 'CUST_D': array(['mail_merge_issue', 'user_management', 'feature_request'],
       dtype=object)}


## 4. Pattern-Based Baseline Classifier

We build a simple rule-based classifier:

- Use **keyword patterns** for each tag (global mapping)
- During prediction, **restrict** to only the allowed tags for that `customer_id`
- Remove / down-weight **anti-pattern words** that are common but low-signal


In [5]:

# Example keyword dictionary.
# You should update/extend this based on the actual tags present in your dataset.
#
# Keys are tag names, values are lists of indicative keywords/phrases.
KEYWORDS = {
    "refund": ["refund", "money back", "chargeback", "reversal", "credited"],
    "billing": ["invoice", "billing", "charged", "payment", "amount due"],
    "technical": ["error", "bug", "not working", "crash", "issue logging in"],
    "password": ["reset password", "password", "login", "authentication"],
    "general_query": ["question", "clarify", "information", "details"],
}

# Words that often appear but are not informative about the true tag.
ANTI_PATTERNS = ["urgent", "request", "asap", "issue", "thanks", "regards"]


def clean_text(text: str) -> str:
    if not isinstance(text, str):
        text = "" if pd.isna(text) else str(text)
    text = text.lower()
    
    # Remove anti-patterns completely for scoring
    for w in ANTI_PATTERNS:
        text = text.replace(w, " ")
    
    # Normalise spaces
    text = re.sub(r"\s+", " ", text).strip()
    return text

def pattern_classifier(subject: str, body: str, allowed_tags) -> str:
    """Simple pattern-based classifier with customer tag isolation.

    Parameters
    ----------
    subject : str
    body : str
    allowed_tags : iterable of tags (only from this customer)

    Returns
    -------
    str
        Predicted tag (one of allowed_tags).
    """
    # Clean and combine
    subject_clean = clean_text(subject)
    body_clean = clean_text(body)
    text = subject_clean + " " + body_clean
    
    # Initialize scores for each allowed tag
    scores = {tag: 0.0 for tag in allowed_tags}
    
    for tag in allowed_tags:
        # If we don't have explicit keywords for this tag, leave its score as 0
        if tag not in KEYWORDS:
            continue
        
        for kw in KEYWORDS[tag]:
            kw_l = kw.lower()
            # Higher weight if keyword appears in subject
            if kw_l in subject_clean:
                scores[tag] += 2.0
            # Lower weight if keyword appears in body
            if kw_l in body_clean:
                scores[tag] += 1.0
    
    # If all scores are 0 (no match), fall back to most frequent tag for this customer
    if all(v == 0 for v in scores.values()):
        # count distribution for that customer in training set
        # note: here we simply use the global frequency within df
        subset = df[df["tag"].isin(allowed_tags)]
        fallback = subset["tag"].value_counts().idxmax()
        return fallback
    
    # Return the tag with the highest score
    return max(scores, key=scores.get)



## 5. Run Predictions (With Customer Isolation)

For each email, we:
- Look up `allowed_tags = customer_tags[customer_id]`
- Run `pattern_classifier` with only those tags


In [6]:

preds = []

for _, row in df.iterrows():
    cid = row["customer_id"]
    allowed = customer_tags[cid]
    pred = pattern_classifier(row.get("subject", ""), row.get("body", ""), allowed)
    preds.append(pred)

df["predicted_tag"] = preds

df[["subject", "body", "customer_id", "tag", "predicted_tag"]].head()


Unnamed: 0,subject,body,customer_id,tag,predicted_tag
0,Unable to access shared mailbox,"Hi team, I'm unable to access the shared mailb...",CUST_A,access_issue,access_issue
1,Rules not working,We created a rule to auto-assign emails based ...,CUST_A,workflow_issue,access_issue
2,Email stuck in pending,One of our emails is stuck in pending even aft...,CUST_A,status_bug,access_issue
3,Automation creating duplicate tasks,Your automation engine is creating 2 tasks for...,CUST_B,automation_bug,automation_bug
4,Tags missing,Many of our tags are not appearing for new ema...,CUST_B,tagging_issue,automation_bug



## 6. Evaluation


In [7]:

accuracy = (df["tag"] == df["predicted_tag"]).mean()
print(f"Accuracy: {accuracy:.4f}")

print("\nClassification Report:")
print(classification_report(df["tag"], df["predicted_tag"]))


Accuracy: 0.4167

Classification Report:
                  precision    recall  f1-score   support

    access_issue       0.33      1.00      0.50         1
 analytics_issue       0.33      1.00      0.50         1
  automation_bug       0.50      1.00      0.67         1
         billing       1.00      1.00      1.00         1
 feature_request       0.00      0.00      0.00         1
mail_merge_issue       0.33      1.00      0.50         1
     performance       0.00      0.00      0.00         1
      setup_help       0.00      0.00      0.00         1
      status_bug       0.00      0.00      0.00         1
   tagging_issue       0.00      0.00      0.00         1
 user_management       0.00      0.00      0.00         1
  workflow_issue       0.00      0.00      0.00         1

        accuracy                           0.42        12
       macro avg       0.21      0.42      0.26        12
    weighted avg       0.21      0.42      0.26        12



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


In [8]:

# Confusion matrix (optional)
labels = sorted(df["tag"].unique())
cm = confusion_matrix(df["tag"], df["predicted_tag"], labels=labels)

cm_df = pd.DataFrame(cm, index=[f"true_{l}" for l in labels],
                        columns=[f"pred_{l}" for l in labels])
cm_df


Unnamed: 0,pred_access_issue,pred_analytics_issue,pred_automation_bug,pred_billing,pred_feature_request,pred_mail_merge_issue,pred_performance,pred_setup_help,pred_status_bug,pred_tagging_issue,pred_user_management,pred_workflow_issue
true_access_issue,1,0,0,0,0,0,0,0,0,0,0,0
true_analytics_issue,0,1,0,0,0,0,0,0,0,0,0,0
true_automation_bug,0,0,1,0,0,0,0,0,0,0,0,0
true_billing,0,0,0,1,0,0,0,0,0,0,0,0
true_feature_request,0,0,0,0,0,1,0,0,0,0,0,0
true_mail_merge_issue,0,0,0,0,0,1,0,0,0,0,0,0
true_performance,0,1,0,0,0,0,0,0,0,0,0,0
true_setup_help,0,1,0,0,0,0,0,0,0,0,0,0
true_status_bug,1,0,0,0,0,0,0,0,0,0,0,0
true_tagging_issue,0,0,1,0,0,0,0,0,0,0,0,0



## 7. Error Analysis

We now look at **misclassified examples** to detect:

- Patterns where rules work well
- Anti-patterns / trap words that confuse the classifier
- Possible improvements for the rules / feature engineering


In [9]:

errors = df[df["tag"] != df["predicted_tag"]].copy()
print("Number of errors:", len(errors))

# Show a few random errors
errors.sample(min(10, len(errors)), random_state=42) if len(errors) > 0 else "No errors!"


Number of errors: 7


Unnamed: 0,email_id,customer_id,subject,body,tag,predicted_tag
1,2,CUST_A,Rules not working,We created a rule to auto-assign emails based ...,workflow_issue,access_issue
2,3,CUST_A,Email stuck in pending,One of our emails is stuck in pending even aft...,status_bug,access_issue
10,11,CUST_D,Can't add new user,Trying to add a new team member but getting an...,user_management,mail_merge_issue
4,5,CUST_B,Tags missing,Many of our tags are not appearing for new ema...,tagging_issue,automation_bug
8,9,CUST_C,Need help setting up SLAs,We want to configure SLAs for different custom...,setup_help,analytics_issue
7,8,CUST_C,Delay in email loading,Opening a conversation takes 8–10 seconds. Thi...,performance,analytics_issue
11,12,CUST_D,Feature request: Dark mode,Dark mode would help during late-night support...,feature_request,mail_merge_issue


In [10]:

# (Optional) Inspect errors per tag
if len(errors) > 0:
    print("Errors per true tag:")
    print(errors["tag"].value_counts())
    
    print("\nErrors per predicted tag:")
    print(errors["predicted_tag"].value_counts())


Errors per true tag:
tag
workflow_issue     1
status_bug         1
tagging_issue      1
performance        1
setup_help         1
user_management    1
feature_request    1
Name: count, dtype: int64

Errors per predicted tag:
predicted_tag
access_issue        2
analytics_issue     2
mail_merge_issue    2
automation_bug      1
Name: count, dtype: int64



## 8. (Conceptual) LLM Prompt-Based Classifier

If using an LLM as a classifier, the core idea is:

1. For a given `customer_id`, get their tag set:  
   `allowed_tags = customer_tags[customer_id]`
2. Build a prompt that **explicitly restricts** the model to only use those tags.
3. Ask the model to return exactly one tag from that list.

**Prompt template (example):**

```text
You are an email tagging assistant for customer support.

EMAIL:
Subject: {subject}
Body: {body}

Allowed tags for this customer: {tag_list}

Classify the email into ONE tag from the allowed list.
Return only the tag text, nothing else.
```

This naturally enforces **customer isolation** at inference time.



## 9. Notes: Patterns, Anti-Patterns & Future Improvements

### Patterns (what helps accuracy)
- Strong lexical matches: words like **"refund"**, **"invoice"**, **"password"**  
- Phrase-level matches: **"not working"**, **"money back"**  
- Giving higher weight to the **subject line**, as it often summarises the intent

### Anti-Patterns (trap words)
These appear often but are not helpful for classification:

- Generic words: `"issue"`, `"request"`  
- Politeness: `"thanks"`, `"regards"`  
- Escalation: `"urgent"`, `"asap"`  

Guardrail: remove or down-weight them (as done in `ANTI_PATTERNS`).

---

### 3 Major Ideas for Productionising

1. **Hybrid Model (Rules + ML Encoder)**  
   - Use a transformer-based encoder (e.g., mini language model) to get embeddings.  
   - Combine embedding similarity with pattern scores for robust predictions.

2. **Per-Customer Few-Shot LLM / Embedding Adaptation**  
   - Store a few representative emails per tag per customer.  
   - At inference, compare the new email to these prototypes to choose the tag.

3. **Confidence & Human-in-the-Loop Guardrails**  
   - If the top tag score / probability is low, flag for human review.  
   - Continually log and learn from corrected tags to update patterns and models.


In [11]:
!pip install -q transformers accelerate sentencepiece

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.7 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m103.7 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m78.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m46.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m1.9 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m31.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [14]:
!pip install -q bitsandbytes


In [15]:
import pandas as pd

# Load your existing file
df = pd.read_csv("/kaggle/input/emails/emails.csv")

# We'll only need subject + body for sentiment, and we’ll use first 10 emails
sent_df = df[["email_id", "subject", "body"]].head(10).copy()
sent_df


Unnamed: 0,email_id,subject,body
0,1,Unable to access shared mailbox,"Hi team, I'm unable to access the shared mailb..."
1,2,Rules not working,We created a rule to auto-assign emails based ...
2,3,Email stuck in pending,One of our emails is stuck in pending even aft...
3,4,Automation creating duplicate tasks,Your automation engine is creating 2 tasks for...
4,5,Tags missing,Many of our tags are not appearing for new ema...
5,6,Billing query,We were charged incorrectly this month. Need a...
6,7,CSAT not visible,CSAT scores disappeared from our dashboard tod...
7,8,Delay in email loading,Opening a conversation takes 8–10 seconds. Thi...
8,9,Need help setting up SLAs,We want to configure SLAs for different custom...
9,10,Mail merge failing,Mail merge is not sending emails even though t...


In [20]:
PROMPT_V1 = """
You are a sentiment analysis engine for customer support emails.

Your task:
- Read the email (subject + body).
- Decide the sentiment from the customer's point of view.
- Output JSON:
  - sentiment: one of ["positive", "negative", "neutral"]
  - confidence: a float between 0 and 1
  - reasoning: a short explanation of why you chose this sentiment (for internal debugging only, not shown to the user).

Guidelines:
- "Negative" if the email expresses a problem, bug, outage, billing issue, frustration, or dissatisfaction, even if written politely.
- "Positive" if the email clearly expresses happiness, satisfaction, praise, or excitement about the product or service.
- "Neutral" if the email is mostly informational or a help request without clear frustration or praise.
- Ignore polite phrases like "please", "thanks", "kind regards" when deciding sentiment.
- Focus on the emotional tone related to the product or service.

Return ONLY a valid JSON object with exactly these keys:
- sentiment
- confidence
- reasoning

Example:
{{
  "sentiment": "negative",
  "confidence": 0.92,
  "reasoning": "Customer reports a feature that stopped working and expresses concern."
}}

Now analyze this email:

Subject: {subject}
Body: {body}
"""


In [17]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "mistralai/Mistral-7B-Instruct-v0.3"  # free, open-source instruct model

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

2025-11-20 10:22:08.629055: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763634128.772080      48 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763634128.819646      48 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [18]:
import json

def run_model(prompt: str, max_new_tokens: int = 256) -> str:
    """
    Send the prompt to the model and get raw text output.
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        output_ids = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.0,   # deterministic
            do_sample=False
        )
    full_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Many instruct models echo the prompt; strip it off if present
    if full_text.startswith(prompt):
        return full_text[len(prompt):].strip()
    return full_text.strip()


def parse_sentiment_output(raw_text: str):
    """
    Try to parse a JSON object from the model output.
    If parsing fails, return a fallback object that includes the raw text in 'reasoning'.
    """
    raw_text = raw_text.strip()

    # Try to isolate JSON block if there is extra text
    start = raw_text.find("{")
    end = raw_text.rfind("}")
    if start != -1 and end != -1 and end > start:
        raw_text = raw_text[start:end+1]

    try:
        obj = json.loads(raw_text)
    except Exception:
        obj = {
            "sentiment": "error",
            "confidence": 0.0,
            "reasoning": raw_text
        }

    return obj


In [21]:
results_v1 = []

for _, row in sent_df.iterrows():
    prompt = PROMPT_V1.format(subject=row["subject"], body=row["body"])
    raw_output = run_model(prompt)
    parsed = parse_sentiment_output(raw_output)

    results_v1.append({
        "email_id": row["email_id"],
        "subject": row["subject"],
        "sentiment_v1": parsed.get("sentiment"),
        "confidence_v1": parsed.get("confidence"),
        "reasoning_v1": parsed.get("reasoning"),
    })

v1_df = pd.DataFrame(results_v1)
v1_df


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end ge

Unnamed: 0,email_id,subject,sentiment_v1,confidence_v1,reasoning_v1
0,1,Unable to access shared mailbox,negative,0.95,Customer reports an issue with accessing a sha...
1,2,Rules not working,negative,0.95,Customer reports a problem with a feature and ...
2,3,Email stuck in pending,negative,0.95,Customer reports an issue with an email being ...
3,4,Automation creating duplicate tasks,negative,0.95,Customer reports a problem with the automation...
4,5,Tags missing,negative,0.95,Customer reports a problem with the tagging fe...
5,6,Billing query,negative,1.0,Customer states they were charged incorrectly ...
6,7,CSAT not visible,negative,0.95,Customer reports a missing feature and asks ab...
7,8,Delay in email loading,negative,0.99,Customer reports a problem with the email load...
8,9,Need help setting up SLAs,neutral,0.85,"Customer is asking for help, but does not expr..."
9,10,Mail merge failing,negative,0.98,Customer reports a problem with the mail merge...


In [22]:
v1_df.to_csv("sentiment_results_v1.csv", index=False)


In [23]:
PROMPT_V2 = """
You are a sentiment analysis engine for customer support emails.

Your task:
- Read the email (subject + body).
- Decide the sentiment from the customer's point of view.
- Output JSON:
  - sentiment: one of ["positive", "negative", "neutral"]
  - confidence: a float between 0 and 1
  - reasoning: a short explanation of why you chose this sentiment (for internal debugging only, not shown to the user).

IMPORTANT RULES:

1. Sentiment definitions:
   - "Negative":
     - The customer reports a product problem, bug, performance issue, incorrect billing, outage, data loss, or any unexpected behavior.
     - Assume billing errors, duplicate charges, or incorrect invoices are negative EVEN if written politely.
   - "Positive":
     - The customer expresses clear satisfaction, praise, or excitement (e.g. "love this", "works great", "very happy").
   - "Neutral":
     - The email is primarily a question, configuration request, or feature request without explicit dissatisfaction or praise.

2. Ignore politeness:
   - Do NOT treat words like "please", "thanks", "kindly", "regards", "sorry" as positive sentiment by themselves.

3. Confidence:
   - 0.80–1.00: sentiment is very clear (strong complaint, clear praise).
   - 0.50–0.79: sentiment is somewhat clear or implied.
   - 0.30–0.49: sentiment is ambiguous and could fit multiple labels.
   - Always output a float between 0 and 1.

4. Reasoning style:
   - 1–2 short sentences.
   - Mention the specific phrases that indicate the sentiment (e.g. "charged incorrectly", "not working", "happy with", etc.).

Return ONLY a valid JSON object with exactly these keys:
- sentiment
- confidence
- reasoning

Example:
{{
  "sentiment": "negative",
  "confidence": 0.90,
  "reasoning": "Customer says they were charged incorrectly, which is a billing problem."
}}

Now analyze this email:

Subject: {subject}
Body: {body}
"""


In [24]:
results_v2 = []

for _, row in sent_df.iterrows():
    prompt = PROMPT_V2.format(subject=row["subject"], body=row["body"])
    raw_output = run_model(prompt)
    parsed = parse_sentiment_output(raw_output)

    results_v2.append({
        "email_id": row["email_id"],
        "subject": row["subject"],
        "sentiment_v2": parsed.get("sentiment"),
        "confidence_v2": parsed.get("confidence"),
        "reasoning_v2": parsed.get("reasoning"),
    })

v2_df = pd.DataFrame(results_v2)
v2_df


The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.
The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end ge

Unnamed: 0,email_id,subject,sentiment_v2,confidence_v2,reasoning_v2
0,1,Unable to access shared mailbox,negative,0.9,Customer reports an issue with accessing the s...
1,2,Rules not working,negative,0.95,Customer reports a problem with the rule not w...
2,3,Email stuck in pending,negative,0.9,Customer reports an email problem.
3,4,Automation creating duplicate tasks,negative,0.95,Customer reports an issue with the automation ...
4,5,Tags missing,negative,0.95,Customer reports that many tags are not appear...
5,6,Billing query,negative,1.0,"Customer says they were charged incorrectly, w..."
6,7,CSAT not visible,neutral,0.8,Customer asks a question about missing CSAT sc...
7,8,Delay in email loading,negative,0.95,Customer mentions a performance issue that aff...
8,9,Need help setting up SLAs,neutral,0.8,The customer is asking a question about settin...
9,10,Mail merge failing,negative,0.95,Customer reports that mail merge is not sendin...


In [25]:
v2_df.to_csv("sentiment_results_v2.csv", index=False)


In [26]:
comparison_df = (
    v1_df
    .merge(v2_df, on=["email_id", "subject"], how="inner")
    .sort_values("email_id")
)

comparison_df


Unnamed: 0,email_id,subject,sentiment_v1,confidence_v1,reasoning_v1,sentiment_v2,confidence_v2,reasoning_v2
0,1,Unable to access shared mailbox,negative,0.95,Customer reports an issue with accessing a sha...,negative,0.9,Customer reports an issue with accessing the s...
1,2,Rules not working,negative,0.95,Customer reports a problem with a feature and ...,negative,0.95,Customer reports a problem with the rule not w...
2,3,Email stuck in pending,negative,0.95,Customer reports an issue with an email being ...,negative,0.9,Customer reports an email problem.
3,4,Automation creating duplicate tasks,negative,0.95,Customer reports a problem with the automation...,negative,0.95,Customer reports an issue with the automation ...
4,5,Tags missing,negative,0.95,Customer reports a problem with the tagging fe...,negative,0.95,Customer reports that many tags are not appear...
5,6,Billing query,negative,1.0,Customer states they were charged incorrectly ...,negative,1.0,"Customer says they were charged incorrectly, w..."
6,7,CSAT not visible,negative,0.95,Customer reports a missing feature and asks ab...,neutral,0.8,Customer asks a question about missing CSAT sc...
7,8,Delay in email loading,negative,0.99,Customer reports a problem with the email load...,negative,0.95,Customer mentions a performance issue that aff...
8,9,Need help setting up SLAs,neutral,0.85,"Customer is asking for help, but does not expr...",neutral,0.8,The customer is asking a question about settin...
9,10,Mail merge failing,negative,0.98,Customer reports a problem with the mail merge...,negative,0.95,Customer reports that mail merge is not sendin...


In [27]:
for _, r in comparison_df.iterrows():
    print(f"Email {r.email_id} — {r.subject}")
    print("  V1 sentiment :", r.sentiment_v1, "| conf:", r.confidence_v1)
    print("  V2 sentiment :", r.sentiment_v2, "| conf:", r.confidence_v2)
    print("  V1 reasoning :", r.reasoning_v1)
    print("  V2 reasoning :", r.reasoning_v2)
    print("-" * 80)


Email 1 — Unable to access shared mailbox
  V1 sentiment : negative | conf: 0.95
  V2 sentiment : negative | conf: 0.9
  V1 reasoning : Customer reports an issue with accessing a shared mailbox and expresses concern.
  V2 reasoning : Customer reports an issue with accessing the shared mailbox, which is a performance issue.
--------------------------------------------------------------------------------
Email 2 — Rules not working
  V1 sentiment : negative | conf: 0.95
  V2 sentiment : negative | conf: 0.95
  V1 reasoning : Customer reports a problem with a feature and expresses concern about it not working.
  V2 reasoning : Customer reports a problem with the rule not working as expected.
--------------------------------------------------------------------------------
Email 3 — Email stuck in pending
  V1 sentiment : negative | conf: 0.95
  V2 sentiment : negative | conf: 0.9
  V1 reasoning : Customer reports an issue with an email being stuck in pending after being marked as resolved.

In [1]:
!pip install -q transformers accelerate sentencepiece


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m363.4/363.4 MB[0m [31m4.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m101.7 MB/s[0m eta [36m0:00:00[0m00:01[0m0:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.6/24.6 MB[0m [31m77.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m883.7/883.7 kB[0m [31m45.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m664.8/664.8 MB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m211.5/211.5 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m56.3/56.3 MB[0m [31m31.1 MB/s[0m eta [36m0:00:00[0m:00:01[0m00:01[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
import os
import glob
import numpy as np
from sentence_transformers import SentenceTransformer

KB_FOLDER = "/kaggle/input/kb-article"  # change if your folder is different

# 1) Load KB articles
kb_docs = []  # list of dicts: {id, title, text}

for path in glob.glob(os.path.join(KB_FOLDER, "*.txt")) + glob.glob(os.path.join(KB_FOLDER, "*.md")):
    with open(path, "r", encoding="utf-8") as f:
        text = f.read().strip()
    title = os.path.splitext(os.path.basename(path))[0]
    kb_docs.append({"id": path, "title": title, "text": text})

print(f"Loaded {len(kb_docs)} KB articles")
for d in kb_docs:
    print("-", d["title"])


2025-11-21 11:12:33.184091: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1763723553.367286      48 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1763723553.417593      48 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

AttributeError: 'MessageFactory' object has no attribute 'GetPrototype'

Loaded 6 KB articles
- performance_issues
- automations_setup
- csat_troubleshooting
- analytics_dashboard_overview
- workflows_overview
- tagging_overview


In [3]:
# 2) Load an open-source embedding model
embed_model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

# 3) Compute embeddings for all KB docs (normalize for cosine similarity)
kb_texts = [d["text"] for d in kb_docs]
kb_embeddings = embed_model.encode(kb_texts, normalize_embeddings=True)
kb_embeddings = np.array(kb_embeddings)
kb_embeddings.shape


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]



model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

(6, 384)

In [4]:
def retrieve_docs(query, top_k=3):
    """
    Returns top_k docs with similarity scores.
    """
    q_emb = embed_model.encode([query], normalize_embeddings=True)[0]  # shape (dim,)
    sims = kb_embeddings @ q_emb  # cosine similarity since normalized
    idxs = np.argsort(-sims)[:top_k]
    
    results = []
    for idx in idxs:
        doc = kb_docs[idx]
        score = float(sims[idx])
        results.append({"title": doc["title"], "id": doc["id"], "text": doc["text"], "score": score})
    return results

def similarity_to_confidence(sim):
    """
    Convert cosine similarity [-1,1] into a rough confidence [0,1].
    Very simple heuristic: threshold at ~0.3, scale to 1.0 by 0.8.
    """
    # clamp sim to [0,1]
    sim = max(0.0, min(1.0, sim))
    # map [0.3, 0.8] -> [0, 1]; below 0.3 -> 0, above 0.8 -> 1
    if sim <= 0.3:
        return 0.0
    if sim >= 0.8:
        return 1.0
    return (sim - 0.3) / (0.8 - 0.3)


In [5]:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

GEN_MODEL_NAME = "mistralai/Mistral-7B-Instruct-v0.3"  # or a smaller instruct model

gen_tokenizer = AutoTokenizer.from_pretrained(GEN_MODEL_NAME)

gen_model = AutoModelForCausalLM.from_pretrained(
    GEN_MODEL_NAME,
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32,
    device_map="auto"
)

def generate_answer_from_context(query, retrieved_docs, max_new_tokens=256):
    """
    Use KB context (top docs) + query to generate an answer.
    """
    context_chunks = []
    for d in retrieved_docs:
        context_chunks.append(f"### Article: {d['title']}\n{d['text']}")
    context = "\n\n---\n\n".join(context_chunks)

    prompt = f"""
You are a support assistant for Hiver.

You have access to the following knowledge base articles:

{context}

Using ONLY this information, answer the user's question.

If the answer is not clearly covered, say that you are not fully sure and suggest next steps.

Question: {query}

Answer in 3–6 sentences, concise and clear:
""".strip()

    inputs = gen_tokenizer(prompt, return_tensors="pt").to(gen_model.device)
    with torch.no_grad():
        output_ids = gen_model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            do_sample=False,
            pad_token_id=gen_tokenizer.eos_token_id
        )
    full_text = gen_tokenizer.decode(output_ids[0], skip_special_tokens=True)
    # Try to strip the prompt if echoed
    if full_text.startswith(prompt):
        return full_text[len(prompt):].strip()
    return full_text.strip()


tokenizer_config.json: 0.00B [00:00, ?B/s]

tokenizer.model:   0%|          | 0.00/587k [00:00<?, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/601 [00:00<?, ?B/s]

model.safetensors.index.json: 0.00B [00:00, ?B/s]

Fetching 3 files:   0%|          | 0/3 [00:00<?, ?it/s]

model-00003-of-00003.safetensors:   0%|          | 0.00/4.55G [00:00<?, ?B/s]

model-00002-of-00003.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

model-00001-of-00003.safetensors:   0%|          | 0.00/4.95G [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/3 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

In [6]:
def rag_query(query, top_k=3, generate=True):
    # 1. Retrieve
    retrieved = retrieve_docs(query, top_k=top_k)
    if not retrieved:
        return {
            "query": query,
            "retrieved": [],
            "answer": "No relevant articles found.",
            "confidence": 0.0,
        }
    
    # 2. Overall confidence = confidence from top similarity
    top_sim = retrieved[0]["score"]
    confidence = similarity_to_confidence(top_sim)
    
    # 3. Generate answer from context
    if generate:
        answer = generate_answer_from_context(query, retrieved)
    else:
        # Simple extractive variant
        answer = "Possible answer from KB:\n\n" + retrieved[0]["text"][:600] + "..."
    
    return {
        "query": query,
        "retrieved": retrieved,
        "answer": answer,
        "confidence": confidence,
    }


In [7]:
q1 = "How do I configure automations in Hiver?"
res1 = rag_query(q1, top_k=3, generate=True)

print("Query:", res1["query"])
print("\nRetrieved articles:")
for d in res1["retrieved"]:
    print(f"- {d['title']} (score={d['score']:.3f})")

print("\nGenerated answer:")
print(res1["answer"])

print(f"\nConfidence score: {res1['confidence']:.2f}")


Query: How do I configure automations in Hiver?

Retrieved articles:
- automations_setup (score=0.839)
- workflows_overview (score=0.610)
- performance_issues (score=0.350)

Generated answer:
To configure automations in Hiver, follow these steps:
1. Go to the Hiver Dashboard.
2. Navigate to Automations.
3. Click "Create Automation".
4. Select a trigger, define conditions, and add actions.
5. Save and enable the automation.

For more detailed instructions, please refer to the "automations_setup" article in our knowledge base.

Confidence score: 1.00


In [8]:
q2 = "Why is CSAT not appearing?"
res2 = rag_query(q2, top_k=3, generate=True)

print("Query:", res2["query"])
print("\nRetrieved articles:")
for d in res2["retrieved"]:
    print(f"- {d['title']} (score={d['score']:.3f})")

print("\nGenerated answer:")
print(res2["answer"])

print(f"\nConfidence score: {res2['confidence']:.2f}")


Query: Why is CSAT not appearing?

Retrieved articles:
- csat_troubleshooting (score=0.699)
- analytics_dashboard_overview (score=0.203)
- performance_issues (score=0.127)

Generated answer:
CSAT may not appear if the shared mailbox's CSAT surveys are not enabled, the conversation was not marked as "Resolved", the email does not include the CSAT snippet, or the CSAT dashboard is filtered. To troubleshoot, ensure CSAT surveys are enabled, check that the conversation was marked as "Resolved", verify that the email includes the CSAT snippet, and confirm that the CSAT dashboard is not filtered. If CSAT stopped appearing suddenly, check recent changes to email templates, updates to the CSAT configuration, and whether Gmail threading removed the CSAT snippet.

Confidence score: 0.80
