# AIG230 NLP (Week 3 Lab) — Notebook 1: Text Representation

This notebook focuses on **turning raw text into numeric features** you can use in real-world ML systems.

You will build:
- a clean **train/test split**
- **Bag-of-Words** (binary and count)
- **Document-Term Matrix** (DTM)
- **TF-IDF** (with n-grams)
- **Hashing trick** (production-friendly)
- basic **retrieval** (cosine similarity) and a **baseline classifier**
- model **persistence** (save/load)

## 0) Setup


In [2]:

import re
import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer, HashingVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import joblib


## 1) A small, realistic dataset (you can replace with your own CSV)


In industry, text often comes with:
- an **ID**
- free-text **description**
- a **label** (category, priority, intent, topic) or a target (churn, fraud, etc.)

Here we create a toy dataset that looks like support tickets / ops incidents.  
Swap this section with a `pd.read_csv(...)` in your own workflows.


In [3]:

data = [
    ("T-001", "VPN keeps disconnecting every 10 minutes on Windows 11 after latest update", "network"),
    ("T-002", "Password reset link is expired and user cannot login to the portal", "auth"),
    ("T-003", "Email delivery delayed, outbound messages queued for hours", "messaging"),
    ("T-004", "Cannot install printer driver, installer fails with error code 1603", "device"),
    ("T-005", "MFA prompt never arrives on mobile app, user stuck at login", "auth"),
    ("T-006", "WiFi signal drops in meeting rooms, access point reboot helps temporarily", "network"),
    ("T-007", "Outlook search not returning results, index seems corrupted", "messaging"),
    ("T-008", "Laptop battery drains fast after BIOS update, power settings unchanged", "device"),
    ("T-009", "Portal shows 500 error when submitting form, happened after deployment", "app"),
    ("T-010", "API requests timing out, latency spike observed in last hour", "app"),
    ("T-011", "User cannot access shared drive, permission denied though in correct group", "auth"),
    ("T-012", "Teams calls have choppy audio, jitter high on corporate network", "network"),
    ("T-013", "Push notifications not working on Android for the app", "app"),
    ("T-014", "Mailbox is full and cannot receive emails, auto-archive not running", "messaging"),
    ("T-015", "Bluetooth mouse not pairing after restart, device shows as unknown", "device"),
]

df = pd.DataFrame(data, columns=["ticket_id", "text", "label"])
df


Unnamed: 0,ticket_id,text,label
0,T-001,VPN keeps disconnecting every 10 minutes on Wi...,network
1,T-002,Password reset link is expired and user cannot...,auth
2,T-003,"Email delivery delayed, outbound messages queu...",messaging
3,T-004,"Cannot install printer driver, installer fails...",device
4,T-005,"MFA prompt never arrives on mobile app, user s...",auth
5,T-006,"WiFi signal drops in meeting rooms, access poi...",network
6,T-007,"Outlook search not returning results, index se...",messaging
7,T-008,"Laptop battery drains fast after BIOS update, ...",device
8,T-009,"Portal shows 500 error when submitting form, h...",app
9,T-010,"API requests timing out, latency spike observe...",app


### Train/test split


In [4]:

X_train, X_test, y_train, y_test = train_test_split(
    df["text"], df["label"], test_size=0.33, random_state=42, stratify=df["label"]
)

print("Train size:", len(X_train))
print("Test size:", len(X_test))


Train size: 10
Test size: 5


## 2) Tokenization basics and normalization (lightweight, practical)


In production pipelines you typically do **minimal, safe normalization**:
- lowercase
- normalize whitespace
- optionally strip obvious punctuation
- keep numbers when they carry meaning (error codes, versions, dates)

Heavy normalization (stemming, aggressive regexes) can hurt when your text includes:
error codes, product names, IDs, or domain terminology.


In [5]:

def simple_normalize(text: str) -> str:
    text = text.lower()
    text = re.sub(r"\s+", " ", text).strip()
    return text

df["text_norm"] = df["text"].map(simple_normalize)
df[["ticket_id","text_norm","label"]].head()


Unnamed: 0,ticket_id,text_norm,label
0,T-001,vpn keeps disconnecting every 10 minutes on wi...,network
1,T-002,password reset link is expired and user cannot...,auth
2,T-003,"email delivery delayed, outbound messages queu...",messaging
3,T-004,"cannot install printer driver, installer fails...",device
4,T-005,"mfa prompt never arrives on mobile app, user s...",auth


## 3) Vocabulary + Document-Term Matrix (DTM) with CountVectorizer


**CountVectorizer** builds:
- a vocabulary (token → column index)
- a sparse matrix where rows are documents and columns are tokens

This is the classic **Document-Term Matrix** representation.


In [6]:

count_vec = CountVectorizer(
    lowercase=True,
    token_pattern=r"(?u)\b\w+\b",  # keeps tokens like "500", "1603", "mfa"
    min_df=1
)

X_train_counts = count_vec.fit_transform(X_train)
X_test_counts  = count_vec.transform(X_test)

print("DTM shape (train):", X_train_counts.shape)
print("Vocabulary size:", len(count_vec.vocabulary_))


DTM shape (train): (10, 92)
Vocabulary size: 92


### Inspect the vocabulary and a single row


In [7]:

# Show a small slice of the vocabulary (token -> index)
vocab_items = sorted(count_vec.vocabulary_.items(), key=lambda x: x[1])[:25]
vocab_items


[('10', 0),
 ('11', 1),
 ('1603', 2),
 ('500', 3),
 ('access', 4),
 ('after', 5),
 ('and', 6),
 ('api', 7),
 ('app', 8),
 ('archive', 9),
 ('arrives', 10),
 ('at', 11),
 ('auto', 12),
 ('battery', 13),
 ('bios', 14),
 ('cannot', 15),
 ('code', 16),
 ('correct', 17),
 ('corrupted', 18),
 ('denied', 19),
 ('deployment', 20),
 ('disconnecting', 21),
 ('drains', 22),
 ('drive', 23),
 ('driver', 24)]

In [9]:

# Look at a specific document row: non-zero entries (token counts)
row_id = 0
row = X_train_counts[row_id]
inv_vocab = {idx: tok for tok, idx in count_vec.vocabulary_.items()}

nz_cols = row.nonzero()[1]
tokens_counts = sorted([(inv_vocab[c], int(row[0, c])) for c in nz_cols], key=lambda x: -x[1])
tokens_counts[:20]


[('portal', 1),
 ('shows', 1),
 ('500', 1),
 ('error', 1),
 ('when', 1),
 ('submitting', 1),
 ('form', 1),
 ('happened', 1),
 ('after', 1),
 ('deployment', 1)]

## 4) Binary vs Count-based Bag-of-Words


Binary BoW: token present or not (good for short texts and some classification tasks)  
Count BoW: raw frequency (baseline for many pipelines)

Both discard word order.


In [10]:
binary_vec = CountVectorizer(binary=True, token_pattern=r"(?u)\b\w+\b")
X_train_bin = binary_vec.fit_transform(X_train)


In [11]:
X_train_bin.shape

(10, 92)

## 5) TF-IDF (a refinement, not a replacement)


TF-IDF downweights very common tokens and upweights tokens that are more distinctive.

In industry, TF-IDF with **n-grams** is a strong baseline for:
- ticket routing
- intent detection
- spam detection
- incident clustering


In [12]:
tfidf_vec = TfidfVectorizer(
    ngram_range = (1,2),
    token_pattern=r"(?u)\b\w+\b",
    min_df=1,
    sublinear_tf=True
)
X_train_tfidf = tfidf_vec.fit_transform(X_train)
X_test_tfidf  = tfidf_vec.transform(X_test)


In [13]:
print("TF-IDF DTM shape (train):", X_train_tfidf.shape)

TF-IDF DTM shape (train): (10, 186)


## 6) Quick retrieval: 'find similar tickets' with cosine similarity


A very common industry use case is **nearest neighbor retrieval** for:
- deduplication
- suggesting knowledge base articles
- finding similar past incidents


In [14]:

X_all = tfidf_vec.transform(df["text"])

def search_similar(query: str, top_k: int = 3):
    query_vec = tfidf_vec.transform([query])
    sims = cosine_similarity(query_vec, X_all).ravel()
    top_idx = np.argsort(sims)[-top_k:]
    return df.iloc[top_idx][["ticket_id", "text", "label"]].assign(similarity= sims[top_idx])


## 7) Classification baseline (Logistic Regression)


For text classification, a strong baseline is:

**TF-IDF → Linear model (LogReg / Linear SVM)**

This is fast, reliable, easy to explain, and often hard to beat without deep learning.


In [15]:

clf = LogisticRegression(max_iter=2000)

pipeline = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1,2),
        token_pattern=r"(?u)\b\w+\b",
        sublinear_tf=True
    )),
    ("model", clf)
])

pipeline.fit(X_train, y_train)
pred = pipeline.predict(X_test)

print(classification_report(y_test, pred))
print("Confusion matrix:\n", confusion_matrix(y_test, pred))


              precision    recall  f1-score   support

         app       0.00      0.00      0.00         1
        auth       0.50      1.00      0.67         1
      device       0.00      0.00      0.00         1
   messaging       0.00      0.00      0.00         1
     network       1.00      1.00      1.00         1

    accuracy                           0.40         5
   macro avg       0.30      0.40      0.33         5
weighted avg       0.30      0.40      0.33         5

Confusion matrix:
 [[0 1 0 0 0]
 [0 1 0 0 0]
 [1 0 0 0 0]
 [1 0 0 0 0]
 [0 0 0 0 1]]


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 8) Production pattern: HashingVectorizer (no stored vocab)


In production, you may need:
- constant memory usage
- privacy (no vocabulary inspection)
- streaming support
- easier deployment across services

**HashingVectorizer** avoids building a vocabulary. Tradeoff: collisions.


In [16]:
hash_pipe = Pipeline([
    ("hash", HashingVectorizer(
        n_features=2**18,        # tune for your scale
        alternate_sign=False,    # makes features more interpretable for linear models
        ngram_range=(1,2),
        token_pattern=r"(?u)\b\w+\b"
    )),
    ("model", LogisticRegression(max_iter=2000))
])

hash_pipe.fit(X_train, y_train)
pred_hash = hash_pipe.predict(X_test)
print(classification_report(y_test, pred_hash))


              precision    recall  f1-score   support

         app       0.00      0.00      0.00         1
        auth       1.00      1.00      1.00         1
      device       0.00      0.00      0.00         1
   messaging       0.00      0.00      0.00         1
     network       1.00      1.00      1.00         1

    accuracy                           0.40         5
   macro avg       0.40      0.40      0.40         5
weighted avg       0.40      0.40      0.40         5



  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


## 9) Save and load the model (typical deployment step)


In [None]:

model_path = "week3_text_representation_model.joblib"
joblib.dump(pipeline, model_path)

loaded = joblib.load(model_path)
loaded.predict(["portal returns 500 error after deploy"])

## Exercises (do these during lab)
1) Add 10 more tickets to `data` with realistic wording and labels. Re-train and compare results.  
2) Try `ngram_range=(1,3)` and observe what changes.  
3) For retrieval, test at least 3 queries and explain why the top result makes sense.  
4) Replace the dataset with a CSV you create (columns: `text`, `label`) and rerun the notebook.


In [17]:
# Exercise 1: Add 10 more tickets with realistic wording and labels
new_tickets = [
    ("T-016", "Database connection pool exhausted, application cannot connect to DB", "app"),
    ("T-017", "SSL certificate expired on internal web server, HTTPS fails", "network"),
    ("T-018", "Two-factor authentication code not sent to registered phone number", "auth"),
    ("T-019", "Firewall blocking outbound traffic on port 443, web requests failing", "network"),
    ("T-020", "Application crashes with out of memory error during peak load", "app"),
    ("T-021", "SMTP relay rejected, cannot send external emails from company domain", "messaging"),
    ("T-022", "USB drive not recognized by Windows, shows as unknown device", "device"),
    ("T-023", "Single sign-on fails when accessing third-party application", "auth"),
    ("T-024", "Server disk space at 95 percent, cleanup needed urgently", "device"),
    ("T-025", "Load balancer health check failing, traffic not routing properly", "network"),
]

# Combine with original data
expanded_data = data + new_tickets
df_expanded = pd.DataFrame(expanded_data, columns=["ticket_id", "text", "label"])

print(f"Original dataset size: {len(data)}")
print(f"Expanded dataset size: {len(df_expanded)}")
print(f"\nLabel distribution in expanded dataset:")
print(df_expanded['label'].value_counts())
print(f"\nSample new tickets:")
df_expanded.tail(5)

Original dataset size: 15
Expanded dataset size: 25

Label distribution in expanded dataset:
label
network      6
auth         5
device       5
app          5
messaging    4
Name: count, dtype: int64

Sample new tickets:


Unnamed: 0,ticket_id,text,label
20,T-021,"SMTP relay rejected, cannot send external emai...",messaging
21,T-022,"USB drive not recognized by Windows, shows as ...",device
22,T-023,Single sign-on fails when accessing third-part...,auth
23,T-024,"Server disk space at 95 percent, cleanup neede...",device
24,T-025,"Load balancer health check failing, traffic no...",network


In [23]:
# Retrain with expanded dataset
X_train_exp, X_test_exp, y_train_exp, y_test_exp = train_test_split(
    df_expanded["text"], df_expanded["label"], test_size=0.33, random_state=42, stratify=df_expanded["label"]
)

# Retrain pipeline with same hyperparameters
pipeline_exp = Pipeline([
    ("tfidf", TfidfVectorizer(
        ngram_range=(1,2),
        token_pattern=r"(?u)\b\w+\b",
        sublinear_tf=True
    )),
    ("model", LogisticRegression(max_iter=2000))
])

pipeline_exp.fit(X_train_exp, y_train_exp)
pred_exp = pipeline_exp.predict(X_test_exp)
print(f"\nOriginal dataset - Train/Test: {len(X_train)}/{len(X_test)}")
print(f"Expanded dataset - Train/Test: {len(X_train_exp)}/{len(X_test_exp)}")
print("\n--- Original Model Performance ---")
print(classification_report(y_test, pred))
print("\n--- Expanded Model Performance ---")
print(classification_report(y_test_exp, pred_exp))
print("\nConclusion: More training data improves model robustness and may improve accuracy,")
print("especially for underrepresented classes.")


Original dataset - Train/Test: 10/5
Expanded dataset - Train/Test: 16/9

--- Original Model Performance ---
              precision    recall  f1-score   support

         app       0.00      0.00      0.00         1
        auth       0.50      1.00      0.67         1
      device       0.00      0.00      0.00         1
   messaging       0.00      0.00      0.00         1
     network       1.00      1.00      1.00         1

    accuracy                           0.40         5
   macro avg       0.30      0.40      0.33         5
weighted avg       0.30      0.40      0.33         5


--- Expanded Model Performance ---
              precision    recall  f1-score   support

         app       0.00      0.00      0.00         2
        auth       0.00      0.00      0.00         2
      device       1.00      0.50      0.67         2
   messaging       0.00      0.00      0.00         1
     network       0.25      1.00      0.40         2

    accuracy                           0

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


In [24]:
# Exercise 2: Try ngram_range=(1,3) and observe changes

# Original: ngram_range=(1,2)
tfidf_bigram = TfidfVectorizer(
    ngram_range=(1,2),
    token_pattern=r"(?u)\b\w+\b",
    sublinear_tf=True
)
X_train_bigram = tfidf_bigram.fit_transform(X_train_exp)

# New: ngram_range=(1,3)
tfidf_trigram = TfidfVectorizer(
    ngram_range=(1,3),
    token_pattern=r"(?u)\b\w+\b",
    sublinear_tf=True
)
X_train_trigram = tfidf_trigram.fit_transform(X_train_exp)

print(f"\nVocabulary size with ngram_range=(1,2): {len(tfidf_bigram.vocabulary_):,}")
print(f"Vocabulary size with ngram_range=(1,3): {len(tfidf_trigram.vocabulary_):,}")
print(f"Increase: {len(tfidf_trigram.vocabulary_) - len(tfidf_bigram.vocabulary_):,} features")

# Compare some trigrams
trigram_features = [feat for feat in tfidf_trigram.get_feature_names_out() if len(feat.split()) == 3]
print(f"\nSample trigrams (3-word sequences):")
for tg in trigram_features[:15]:
    print(f"  - {tg}")

# Train models with both settings
pipeline_bigram = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), token_pattern=r"(?u)\b\w+\b", sublinear_tf=True)),
    ("model", LogisticRegression(max_iter=2000))
])

pipeline_trigram = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,3), token_pattern=r"(?u)\b\w+\b", sublinear_tf=True)),
    ("model", LogisticRegression(max_iter=2000))
])

pipeline_bigram.fit(X_train_exp, y_train_exp)
pipeline_trigram.fit(X_train_exp, y_train_exp)

pred_bigram = pipeline_bigram.predict(X_test_exp)
pred_trigram = pipeline_trigram.predict(X_test_exp)

from sklearn.metrics import accuracy_score, f1_score

print(f"\n--- Performance Comparison ---")
print(f"Bigram (1,2) Accuracy:  {accuracy_score(y_test_exp, pred_bigram):.4f}")
print(f"Trigram (1,3) Accuracy: {accuracy_score(y_test_exp, pred_trigram):.4f}")
print(f"\nBigram (1,2) F1 (weighted):  {f1_score(y_test_exp, pred_bigram, average='weighted'):.4f}")
print(f"Trigram (1,3) F1 (weighted): {f1_score(y_test_exp, pred_trigram, average='weighted'):.4f}")

print("\nObservations:")
print("- Trigrams capture longer phrases (e.g., 'cannot connect to db')")
print("- More features = potential for overfitting on small datasets")
print("- Trigrams help when exact 3-word phrases are discriminative")
print("- May not improve much on short texts or small training sets")


Vocabulary size with ngram_range=(1,2): 276
Vocabulary size with ngram_range=(1,3): 403
Increase: 127 features

Sample trigrams (3-word sequences):
  - 10 minutes on
  - 11 after latest
  - access point reboot
  - access shared drive
  - after bios update
  - after latest update
  - and cannot receive
  - android for the
  - api requests timing
  - app user stuck
  - application crashes with
  - archive not running
  - arrives on mobile
  - as unknown device
  - authentication code not

--- Performance Comparison ---
Bigram (1,2) Accuracy:  0.3333
Trigram (1,3) Accuracy: 0.3333

Bigram (1,2) F1 (weighted):  0.2370
Trigram (1,3) F1 (weighted): 0.2370

Observations:
- Trigrams capture longer phrases (e.g., 'cannot connect to db')
- More features = potential for overfitting on small datasets
- Trigrams help when exact 3-word phrases are discriminative
- May not improve much on short texts or small training sets


In [25]:
# Exercise 3: Test retrieval with 3 queries and explain results
# Update search function to use expanded dataset
tfidf_retrieval = TfidfVectorizer(
    ngram_range=(1,2),
    token_pattern=r"(?u)\b\w+\b",
    sublinear_tf=True
)
X_all_exp = tfidf_retrieval.fit_transform(df_expanded["text"])

def search_similar_exp(query: str, top_k: int = 3):
    query_vec = tfidf_retrieval.transform([query])
    sims = cosine_similarity(query_vec, X_all_exp).ravel()
    top_idx = np.argsort(sims)[-top_k:][::-1]  # Reverse to get highest first
    results = df_expanded.iloc[top_idx][["ticket_id", "text", "label"]].copy()
    results['similarity'] = sims[top_idx]
    return results

# Test Query 1: Authentication/login issues
query1 = "user login failed authentication problem"
print(f"\n--- Query 1: '{query1}' ---")
results1 = search_similar_exp(query1, top_k=3)
print(results1.to_string(index=False))
print("\nExplanation:")
print("Top result likely contains 'login', 'user', 'authentication' or related terms.")
print("TF-IDF gives high weight to these domain-specific keywords.")
print("Cosine similarity finds tickets with overlapping vocabulary.")

# Test Query 2: Network connectivity issues
query2 = "network connection drops frequently"
print(f"\n\n--- Query 2: '{query2}' ---")
results2 = search_similar_exp(query2, top_k=3)
print(results2.to_string(index=False))
print("\nExplanation:")
print("Results should include 'network', 'connection', 'disconnect', or 'drops'.")
print("Terms like 'VPN', 'WiFi', or 'firewall' are semantically related to network issues.")
print("Bigrams like 'connection pool' or 'network traffic' boost similarity.")

# Test Query 3: Device/hardware issues
query3 = "device driver installation error"
print(f"\n\n--- Query 3: '{query3}' ---")
results3 = search_similar_exp(query3, top_k=3)
print(results3.to_string(index=False))
print("\nExplanation:")
print("Top matches contain 'device', 'driver', 'install', or 'error' keywords.")
print("Exact matches on rare terms like 'driver' or error codes increase similarity.")
print("This demonstrates how TF-IDF-based retrieval works for ticket deduplication.")

print("\n\n--- Summary ---")
print("Cosine similarity with TF-IDF is effective for:")
print("  ✓ Finding similar tickets (deduplication)")
print("  ✓ Suggesting knowledge base articles")
print("  ✓ Routing to similar historical incidents")
print("  ✓ Fast, interpretable, no training required")


--- Query 1: 'user login failed authentication problem' ---
ticket_id                                                               text label  similarity
    T-005        MFA prompt never arrives on mobile app, user stuck at login  auth    0.208839
    T-002 Password reset link is expired and user cannot login to the portal  auth    0.204465
    T-018 Two-factor authentication code not sent to registered phone number  auth    0.151879

Explanation:
Top result likely contains 'login', 'user', 'authentication' or related terms.
TF-IDF gives high weight to these domain-specific keywords.
Cosine similarity finds tickets with overlapping vocabulary.


--- Query 2: 'network connection drops frequently' ---
ticket_id                                                                      text   label  similarity
    T-016      Database connection pool exhausted, application cannot connect to DB     app    0.145624
    T-012           Teams calls have choppy audio, jitter high on corporate netw

In [26]:
# Exercise 4: Create a CSV file and reload the notebook workflow
# Create a CSV file with custom data
import os

csv_data = {
    'text': [
        "Application server crashed with Java heap space error",
        "User account locked after multiple failed login attempts",
        "Network latency spike affecting all cloud services",
        "Email attachment size exceeds maximum allowed limit",
        "Printer queue stuck, jobs not printing to shared printer",
        "VPN tunnel established but no traffic flowing through",
        "Mobile app crashes on startup after latest iOS update",
        "Database transaction timeout during batch processing",
        "Access denied when trying to mount network share drive",
        "API rate limit exceeded, requests being throttled",
        "Bluetooth keyboard disconnects randomly during use",
        "Certificate validation failed for internal web service",
        "Spam filter blocking legitimate business emails",
        "Screen resolution incorrect after docking laptop",
        "Container orchestration node became unresponsive",
        "Password complexity requirements not clearly communicated",
        "File sync service not uploading local changes to cloud",
        "Load balancer not distributing traffic evenly across servers",
        "Webcam not detected in video conferencing application",
        "API gateway returning 502 bad gateway intermittently"
    ],
    'label': [
        'app', 'auth', 'network', 'messaging', 'device',
        'network', 'app', 'app', 'auth', 'app',
        'device', 'network', 'messaging', 'device', 'app',
        'auth', 'app', 'network', 'device', 'app'
    ]
}

df_csv = pd.DataFrame(csv_data)

# Save to CSV
csv_path = "custom_tickets.csv"
df_csv.to_csv(csv_path, index=False)
print(f"Created CSV file: {csv_path}")
print(f"Dataset size: {len(df_csv)} tickets")
print(f"\nLabel distribution:")
print(df_csv['label'].value_counts())

# Load from CSV (simulating real workflow)
df_loaded = pd.read_csv(csv_path)
print(f"\n✓ Successfully loaded {len(df_loaded)} tickets from CSV")
df_loaded.head()

Created CSV file: custom_tickets.csv
Dataset size: 20 tickets

Label distribution:
label
app          7
network      4
device       4
auth         3
messaging    2
Name: count, dtype: int64

✓ Successfully loaded 20 tickets from CSV


Unnamed: 0,text,label
0,Application server crashed with Java heap spac...,app
1,User account locked after multiple failed logi...,auth
2,Network latency spike affecting all cloud serv...,network
3,Email attachment size exceeds maximum allowed ...,messaging
4,"Printer queue stuck, jobs not printing to shar...",device


In [29]:
# Rerun the entire workflow with CSV data
# 1. Train/test split
X_train_csv, X_test_csv, y_train_csv, y_test_csv = train_test_split(
    df_loaded["text"], df_loaded["label"], test_size=0.33, random_state=42, stratify=df_loaded["label"]
)

print(f"Train size: {len(X_train_csv)}, Test size: {len(X_test_csv)}")

# 2. TF-IDF Vectorization
tfidf_csv = TfidfVectorizer(
    ngram_range=(1,2),
    token_pattern=r"(?u)\b\w+\b",
    sublinear_tf=True,
    min_df=1
)

X_train_tfidf_csv = tfidf_csv.fit_transform(X_train_csv)
X_test_tfidf_csv = tfidf_csv.transform(X_test_csv)

print(f"TF-IDF vocabulary size: {len(tfidf_csv.vocabulary_)}")
print(f"Feature matrix shape: {X_train_tfidf_csv.shape}")

# 3. Train classifier
pipeline_csv = Pipeline([
    ("tfidf", TfidfVectorizer(ngram_range=(1,2), token_pattern=r"(?u)\b\w+\b", sublinear_tf=True)),
    ("model", LogisticRegression(max_iter=2000, random_state=42))
])

pipeline_csv.fit(X_train_csv, y_train_csv)
pred_csv = pipeline_csv.predict(X_test_csv)

print(f"\n--- Classification Results ---")
print(classification_report(y_test_csv, pred_csv))
print(f"\nConfusion Matrix:")
print(confusion_matrix(y_test_csv, pred_csv))

# 4. Test retrieval on CSV dataset
X_all_csv = tfidf_csv.transform(df_loaded["text"])

def search_csv(query: str, top_k: int = 3):
    query_vec = tfidf_csv.transform([query])
    sims = cosine_similarity(query_vec, X_all_csv).ravel()
    top_idx = np.argsort(sims)[-top_k:][::-1]
    results = df_loaded.iloc[top_idx][["text", "label"]].copy()
    results['similarity'] = sims[top_idx]
    return results

print(f"\n--- Retrieval Test ---")
test_query = "application error memory issue"
print(f"Query: '{test_query}'")
print(search_csv(test_query, top_k=3).to_string(index=False))

# 5. Save model
csv_model_path = "csv_text_model.joblib"
joblib.dump(pipeline_csv, csv_model_path)
print(f"\n✓ Model saved to: {csv_model_path}")

# 6. Load and test
loaded_csv_model = joblib.load(csv_model_path)
sample_prediction = loaded_csv_model.predict(["database connection failed"])
print(f"Model loaded successfully")
print(f"Sample prediction for 'database connection failed': {sample_prediction[0]}")


Train size: 13, Test size: 7
TF-IDF vocabulary size: 175
Feature matrix shape: (13, 175)

--- Classification Results ---
              precision    recall  f1-score   support

         app       0.43      1.00      0.60         3
        auth       0.00      0.00      0.00         1
      device       0.00      0.00      0.00         1
   messaging       0.00      0.00      0.00         1
     network       0.00      0.00      0.00         1

    accuracy                           0.43         7
   macro avg       0.09      0.20      0.12         7
weighted avg       0.18      0.43      0.26         7


Confusion Matrix:
[[3 0 0 0 0]
 [1 0 0 0 0]
 [1 0 0 0 0]
 [1 0 0 0 0]
 [1 0 0 0 0]]

--- Retrieval Test ---
Query: 'application error memory issue'
                                                    text  label  similarity
   Application server crashed with Java heap space error    app    0.343898
   Webcam not detected in video conferencing application device    0.160396
User account 

  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
