## <span style="color:#0066cc; font-weight:bold">INFERENCE PIPELINE FOR NEW TICKET CLASSIFICATION AND NEXT STEP RECOMMENDATION</span>

<p style="font-size:18px;">
This notebook orchestrates the total pipeline for a new ticket that is saved on PostgreSQL tickets table, using its ticket_id.<br> 
I am using some of the 128 tickets that I did not touch for any preprocessing, clustering or recommendations that I have marked with <strong>demo_flag = TRUE</strong>.
</p>


### <span style="color:#4da6ff; font-weight:bold">Demo setup

<p style="font-size:16px;">This setup involes cherry-picking at least 3 tickets( each represnting EN, SV, FI) from the "ticket" table where demo_flag = TRUE.
These tickets have not been preprocessed, embedded or previously seen by any models downstream.</p>

In [None]:
# Add project root to path 
import sys
from pathlib import Path
sys.path.append(str(Path("../").resolve()))

In [None]:
# Generic imports
import pandas as pd 


# imports from my project
import src.preprocessing_utils as prep
from src.db_utils import fetch_and_cleanup_tickets
import src.preprocessing_utils as prep
from src.db_utils import fetch_preprocessed_tickets, fetch_already_embedded_ticket_ids, insert_embeddings
from src.embedding_utils import batch_embed_and_store,embed_new_tickets
import src.recommendation_utils as ru
import src.diagnostic_utils as dgn


# Show full column content
pd.set_option('display.max_colwidth', None)

In [None]:
# Specify ticket_id for demo tickets

# English example : "TKT-524587"
# Finish example: "TKT-537091"
# Swedish example: "TKT-520235"
# Absurd ticket: "TKT-600000"

ticket_ids = ["TKT-524587", "TKT-537091", "TKT-520235", "TKT-600000" ]

In [None]:
# Now let's fetch the relevant records from the table "tickets".

new_tickets = fetch_and_cleanup_tickets(ticket_ids)
new_tickets

### <span style="color:#4da6ff; font-weight:bold">Preprocessing tickets</span>

<p style="font-size:16px;">
In this section we will preprocess and write the results to the table <code>ticket_preprocessed</code>. There are 7 steps altogether:
</p>
<ol style="font-size:16px;">
    <li>Language detection with langdetect</li>
    <li>Text cleaning (removal of emojis and extra spacing)</li>
    <li>Translation to English using <code>Helsinki-NLP/opus-mt-sv-en</code> / <code>Helsinki-NLP/opus-mt-fi-en</code></li>
    <li>PII masking for names, locations, card numbers, telephone numbers, IBAN and email</li>
    <li>Combine cleaned and translated subject and body</li>
    <li>Keyword extraction from the above combination for verification purposes</li>
    <li>Write the preprocessed ticket data to PostgreSQL</li>
</ol>


In [None]:
preprocessed_tickets = prep.preprocess_tickets(new_tickets)

In [None]:
preprocessed_tickets

### <span style="color:#4da6ff; font-weight:bold">Embedding combined_text (subject + body)</span>

<ul style="font-size:16px;">
    <li>Now I am going to embed the <code>combined_text</code> field from <code>tickets_preprocessed</code> table.</li>
    <li>This field is <code>{translated_subject || masked_body}</code> which I feel is sufficient to categorize issue types.</li>
    <li>The embeddings I use is <code>all-MiniLM-L6-v2</code> from SentenceTransformers which outputs a 384-dimensional vector.</li>
    <li>This vector will also be saved on the <code>ticket_embedding</code> table on PostgreSQL for later steps.</li>
</ul>

<p style="font-size:16px;"><strong>Note:</strong> The embedding model is configurable (change <code>src.configs.LOCAL_EMBEDDING_MODEL</code> and the vector size of the <code>embedding</code> column).</p>


In [None]:
embed_new_tickets()

### <span style="color:#4da6ff; font-weight:bold">Using saved HDBSCAN model to predict a new ticket's category and next step recommendation </span>

<p style="font-size:16px;">
The output dimension of the embeddings (384) is too high for clustering, since distance measures become less meaningful and clustering itself scales poorly. 
I have used <strong>UMAP (Uniform Manifold Approximation and Projection)</strong> to project the embeddings into a reasonable dimension before clustering. 
This projector has been saved as a <code>.pkl</code> file so that any new tickets go through the exact same projector before being categorized by the clustering model.
</p>

<p style="font-size:16px;">
I have implemented a clusterer using <strong>HDBSCAN (Hierarchical Density-Based Spatial Clustering of Applications with Noise)</strong> because:
</p>
<ol style="font-size:16px;">
    <li>The number of clusters, their shapes, and densities are unknown.</li>
    <li>The dataset could be noisy and contain many outliers.</li>
</ol>

<p style="font-size:16px;">
The generated and saved clusterer should be loaded for this section. It will be used to categorize new tickets using <code>approximate_predict</code> in HDBSCAN. 
Note that even a new ticket with a previously unseen issue category will still fall into an existing cluster but with a <strong><span style="color:red;">very low confidence score. </span></strong>
</p>

<p style="font-size:16px;">
I maintain a <code>cluster_labels</code> table which has human-readable issue category names. 
A local LLM (<code>mistral:7b</code> via Ollama currently) is prompted to inspect the <code>combined_text</code> of the top-N (10 currently) tickets closest to each cluster center and issue a suitable natural language label. 
</p>

<p style="font-size:16px;">
After the issue categorization, the recommender program will:
</p>
<ol style="font-size:16px;">
    <li>Find the closest ticket in the same cluster.</li>
    <li>Prefer resolved/closed tickets; otherwise use the nearest open ticket.</li>
    <li>Compare <code>internal_comments</code> via an LLM (<code>mistral:7b</code> via Ollama currently) to suggest the next recommended step. 
    This could be knowledge from the neighbor or general advice depending on the neighbor_confidence (I am using > 0.8).</li>
</ol>

<p style="font-size:16px;"> <strong>Note:</strong> The local model can be configured via <code>src.ai_utils</code>.</p>


In [None]:
# Store results
results = []

# Get cluster id, label, and suggested next step for each ticket
for tid in ticket_ids:
    (
        suggestion,
        c_confidence,
        n_confidence,
        cluster_id,
        cluster_label,
        best_comments,
        nearest_ticket_id,
        nearest_ticket_status,
        internal_comments
    ) = ru.find_recommendation(tid)


    results.append({
        "ticket_id": tid,
        "Cluster ID": cluster_id,
        "Cluster Label": cluster_label,
        "Cluster confidence": round(c_confidence, 2),
        "Neighbour confidence": round(n_confidence, 2),
        "Nearest Ticket": nearest_ticket_id,
        "Nearest Ticket Status": nearest_ticket_status,
        "Neighbouring Action Sequence": best_comments,
        "Suggested Next Step": suggestion,
        "Internal Comments": internal_comments
    })

# Convert to DataFrame
df_results = pd.DataFrame(results)

# Join with preprocessed_tickets to add combined_text 
if 'ticket_id' not in preprocessed_tickets.columns:
    raise KeyError("`preprocessed_tickets` must contain a 'ticket_id' column for joining.")

# Select only needed columns to avoid duplication or bloat
df_joined = df_results.merge(
    preprocessed_tickets[["ticket_id", "combined_text"]],
    on="ticket_id",
    how="left"
)

# Define desired column order
column_order = [
    "ticket_id",
    "Cluster ID",
    "Cluster Label",
    "combined_text",
    "Cluster confidence",
    "Nearest Ticket",
    "Nearest Ticket Status",
    "Neighbouring Action Sequence",
    "Internal Comments",
    "Suggested Next Step",
    "Neighbour confidence"
]

# Reorder columns
df_joined = df_joined[column_order]

# Display final DataFrame
display(df_joined)





### <span style="color:#4da6ff; font-weight:bold">Results validation</span>

<div style="font-size:16px;">
    <p>
        This section demonstrates some of the possible diagnostic tools to make sense of why results are the way they are.
    </p>
    <ul>
        <li><strong>Diagnostic step 1:</strong> Fetches the <code>subject_translated</code> and <code>body_translated</code> for a given ticket and its closest neighbour for manual comparison and verification by a domain expert.</li>
        <li><strong>Diagnostic step 2:</strong> Reduces the embedding dimension to 2D and visualises all clusters, highlighting a given ticket. </li>
        <li><strong>Diagnostic step 3:</strong> Reduces the embedding dimension to 2D and visualises a given cluster, highlighting a given ticket and its neighbour.</li>
    </ul>
</div>
<p style="font-size:16px;"> <strong>Note:</strong> You can add more diagnostic functions as required to <code>src.diagnostic_utils</code>.</p>



In [None]:
# Provide the focus for diagnostics from the above results here. 
ticket_id = "TKT-520235"
neighbour_id = "TKT-543369"
cluster_id = 22

<div style="font-size:16px;"><strong>Diagnostic step 1:</strong></div>

In [None]:
# Fetch subject_translated and body_translated for a given ticket and its closest neighbour
dgn_df = dgn.fetch_ticket_pair_details(ticket_id, neighbour_id, cluster_id)
dgn_df

<div style="font-size:16px;"><strong>Diagnostic step 2:</strong></div>

In [None]:
dgn.plot_all_clusters(ticket_id)

<div style="font-size:16px;"><strong>Diagnostic step 3:</strong></div>

In [None]:
dgn.plot_cluster_projection(ticket_id, neighbour_id, cluster_id)