## <span style="color:#0066cc; font-weight:bold">INFERENCE PIPELINE FOR NEW TICKET CLASSIFICATION AND NEXT STEP RECOMMENDATION</span>

<p style="font-size:18px;">
This notebook orchestrates the total pipeline for a new ticket that is saved on PostgreSQL tickets table, using its ticket_id.<br> 
I am using some of the 128 tickets that I did not touch for any preprocessing, clustering or recommendations that I have marked with <strong>demo_flag = TRUE</strong>.
</p>


### <span style="color:#4da6ff; font-weight:bold">Demo setup

<p style="font-size:16px;">This setup involes cherry-picking at least 3 tickets( each represnting EN, SV, FI) from the "ticket" table where demo_flag = TRUE.
These tickets have not been preprocessed, embedded or previously seen by any models downstream.</p>

In [1]:
# Add project root to path 
import sys
from pathlib import Path
sys.path.append(str(Path("../").resolve()))

In [2]:
# Generic imports
import pandas as pd 


# imports from my project
import src.preprocessing_utils as prep
from src.db_utils import fetch_and_cleanup_tickets
import src.preprocessing_utils as prep
from src.db_utils import fetch_preprocessed_tickets, fetch_already_embedded_ticket_ids, insert_embeddings
from src.embedding_utils import batch_embed_and_store,embed_new_tickets
import src.recommendation_utils as ru


# Show full column content
pd.set_option('display.max_colwidth', None)

In [3]:
# Specify ticket_id for demo tickets

# English example : "TKT-524587"
# Finish example: "TKT-537091"
# Swedish example: "TKT-520235"

ticket_ids = ["TKT-524587", "TKT-537091", "TKT-520235" ]

In [4]:
# Now let's fetch the relevant records from the table "tickets".

new_tickets = fetch_and_cleanup_tickets(ticket_ids)
#new_tickets

### <span style="color:#4da6ff; font-weight:bold">Preprocessing tickets</span>

<p style="font-size:16px;">
In this section we will preprocess and write the results to the table <code>ticket_preprocessed</code>. There are 7 steps altogether:
</p>
<ol style="font-size:16px;">
    <li>Language detection with langdetect</li>
    <li>Text cleaning (removal of emojis and extra spacing)</li>
    <li>Translation to English using <code>Helsinki-NLP/opus-mt-sv-en</code> / <code>Helsinki-NLP/opus-mt-fi-en</code></li>
    <li>PII masking for names, locations, card numbers, telephone numbers, IBAN and email</li>
    <li>Combine cleaned and translated subject and body</li>
    <li>Keyword extraction from the above combination for verification purposes</li>
    <li>Write the preprocessed ticket data to PostgreSQL</li>
</ol>


In [5]:

preprocessed_tickets = prep.preprocess_tickets(new_tickets)


Cleaning & detecting language: 100%|██████████████| 3/3 [00:00<00:00, 14.79it/s]


[INFO] Translated batch 1 (1 texts) for language 'sv'
[INFO] Translated batch 1 (1 texts) for language 'sv'
[INFO] Translated 1 tickets from 'sv' to English
[INFO] Translated batch 1 (1 texts) for language 'fi'
[INFO] Translated batch 1 (1 texts) for language 'fi'
[INFO] Translated 1 tickets from 'fi' to English


Masking PII: 100%|███████████████████████████████| 3/3 [00:00<00:00, 326.41it/s]
Masking PII: 100%|███████████████████████████████| 3/3 [00:00<00:00, 130.32it/s]
Extracting keywords: 100%|██████████████████████| 3/3 [00:00<00:00, 10.91text/s]

[INFO] Keyword extraction completed for 3 texts
[INFO] DataFrame written to table 'ticket_preprocessed' successfully.
Preprocessed rows written to "ticket_preprocessed" table.





In [6]:
#preprocessed_tickets

### <span style="color:#4da6ff; font-weight:bold">Embedding combined_text (subject + body)</span>

<ul style="font-size:16px;">
    <li>Now I am going to embed the <code>combined_text</code> field from <code>tickets_preprocessed</code> table.</li>
    <li>This field is <code>{translated_subject || masked_body}</code> which I feel is sufficient to categorize issue types.</li>
    <li>The embeddings I use is <code>all-MiniLM-L6-v2</code> from SentenceTransformers which outputs a 384-dimensional vector.</li>
    <li>This vector will also be saved on the <code>ticket_embedding</code> table on PostgreSQL for later steps.</li>
</ul>

<p style="font-size:16px;"><strong>Note:</strong> The embedding model is configurable (change <code>src.configs.LOCAL_EMBEDDING_MODEL</code> and the vector size of the <code>embedding</code> column).</p>


In [7]:
embed_new_tickets()

[INFO] 3 tickets left to embed
Embedding 3 tickets (batch size 128)


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Inserted embeddings for batch 0-2
Embeddings for 3 tickets added to "ticket_embeddings" table.


### <span style="color:#4da6ff; font-weight:bold">Using saved HDBSCAN model to predict a new ticket's category and next step recommendation </span>

<p style="font-size:16px;">
The output dimension of the embeddings (384) is too high for clustering, since distance measures become less meaningful and clustering itself scales poorly. 
I have used <strong>UMAP (Uniform Manifold Approximation and Projection)</strong> to project the embeddings into a reasonable dimension before clustering. 
This projector has been saved as a <code>.pkl</code> file so that any new tickets go through the exact same projector before being categorized by the clustering model.
</p>

<p style="font-size:16px;">
I have implemented a clusterer using <strong>HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise)</strong> because:
</p>
<ol style="font-size:16px;">
    <li>The number of clusters, their shapes, and densities are unknown.</li>
    <li>The dataset could be noisy and contain many outliers.</li>
</ol>

<p style="font-size:16px;">
The generated and saved clusterer should be loaded for this section. It will be used to categorize new tickets using <code>approximate_predict</code> in HDBSCAN. 
Note that even a new ticket with a previously unseen issue category will still fall into an existing cluster but with a <strong><span style="color:red;">very low confidence score. </span></strong>
</p>

<p style="font-size:16px;">
I maintain a <code>cluster_labels</code> table which has human-readable issue category names. 
A local LLM (<code>mistral:7b</code> via Ollama currently) is prompted to inspect the <code>combined_text</code> of the top-N (10 currently) tickets closest to each cluster center and issue a suitable natural language label. 
</p>

<p style="font-size:16px;">
After the issue categorization, the recommender program will:
</p>
<ol style="font-size:16px;">
    <li>Find the closest ticket in the same cluster.</li>
    <li>Prefer resolved/closed tickets; otherwise use the nearest open ticket.</li>
    <li>Compare <code>internal_comments</code> via an LLM (<code>mistral:7b</code> via Ollama currently) to suggest the next recommended step. 
    This could be knowledge from the neighbor or general advice depending on the neighbor's status.</li>
</ol>

<p style="font-size:16px;"> <strong>Note:</strong> The local model can be configured via <code>src.ai_utils</code>.</p>


In [8]:
# Store results
results = []

# Get cluster id and label with suggested next step
for tid in ticket_ids:
    suggestion, confidence, cluster_id, cluster_label, best_comments, nearest_ticket_id, nearest_ticket_status = ru.find_recommendation(tid)
    results.append({
        "Ticket ID": tid,
        "Cluster ID": cluster_id,
        "Cluster Label": cluster_label,
        "Confidence": round(confidence, 2),
        "Nearest Ticket": nearest_ticket_id,
        "Nearest ticket status": nearest_ticket_status,
        "Neighbouring action sequence" :best_comments,
        "Suggested Next Step": suggestion
    })

# Convert to dataFrame 
df_results = pd.DataFrame(results)




Table "tickets" updated with TKT-524587's cluster_id, cluster_label, suggestion and confidence.
Table "tickets" updated with TKT-537091's cluster_id, cluster_label, suggestion and confidence.
Table "tickets" updated with TKT-520235's cluster_id, cluster_label, suggestion and confidence.


In [9]:
df_results

Unnamed: 0,Ticket ID,Cluster ID,Cluster Label,Confidence,Nearest Ticket,Nearest ticket status,Neighbouring action sequence,Suggested Next Step
0,TKT-524587,33,Urgent Card Issue,1.0,TKT-545894,closed,- Replied to customer; requested last 4 of PAN and timestamp. - Escalated to TechOps; possible regional issue. - Checked auth logs; decline code 05 for last two attempts. - Replied to customer; requested last 4 of PAN and timestamp.,"Escalate the issue to TechOps, as there seems to be a possible regional problem based on the decline code 05 in the auth logs."
1,TKT-537091,20,Increase API Quota Inquiry,1.0,TKT-536844,closed,- Eskalointi TechOpsille; mahdollinen alueellinen vika. - Ei häiriötä status-sivulla; seuranta käynnissä.,Escalate the issue to TechOps for potential local problem diagnosis and resolution.
2,TKT-520235,22,Tokenization/Apple Pay Issues,0.43,TKT-543369,resolved,- Eskalering till TechOps; möjligt regionalt fel. - Svarade kunden; begärde sista 4 av kort och tidsstämpel.,"Escalate the issue to TechOps, requesting a regional check and providing the last 4 digits of the affected device along with timestamps."
